[ROCm][MLA] FP8 ASM prefill on gfx950 for AITER MLA backend#42294
[ROCm][MLA] FP8 ASM prefill on gfx950 for AITER MLA backend#42294maeehart wants to merge 4 commits into
Conversation
4d683f7 to
e3c5d80
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces FP8 MLA prefill support for ROCm backends using persistent-scheduling assembly kernels and refactors the sparse MLA implementation to inherit from a common base for better maintainability. Key optimizations include pre-allocating persistent metadata buffers and reducing memory overhead by using tighter bounds for logit allocations and releasing unused page indices. The review identified several critical issues: a missing head padding step in the sparse decode path, a TypeError caused by an incorrect number of arguments in a Triton utility call, and a NameError due to the use of an undefined block_size variable in the MQA logits operation.
6576cec to
bd1e44d
Compare
… fix) Two real bugs were causing the disables that landed in 8f33dc8 and c59b41c: A) Persistent sparse decode kernel (mla_a16w16_qh16_m16x4_n16x1_coex0_mask1_ps) was crashing because ROCMAiterMLASparseMetadataBuilder sized the persistent buffers via `get_mla_metadata_info_v1(..., is_sparse=True)` but populated them via `get_mla_metadata_v1(...)` *without* a `topk` kwarg. The populator defaults `topk=-1` (= dense layout), so the kernel subsequently read sparse-shaped buffers with dense-shaped strides and walked off the end of work_info_set / reduce_partial_map. Fix: pass `topk=self.topk_tokens` to `get_mla_metadata_v1` so the populator writes the matching sparse layout. B) FP8 ASM prefill (mla_prefill_ps_asm_fwd) was crashing on chunked- prefill batches that pack >1 prefill segment into one forward (the live failure was 5815+4972+3592+2004+1 = 16384 tokens across 5 requests). The kernel only safely handles a single prefill segment per launch. Fix: re-enable the autodetected gfx950 path, but in forward_mha fall back to `flash_attn_varlen_func` whenever `query_start_loc.shape[0] - 1 > 1`. Single-request prefill (the common DSV3.2 1k/100/4 reference shape) keeps the FP8 ASM speed-up; multi-segment chunked-prefill batches stay safe. With both fixes the sparse-MLA-on-AiterMLA architecture from vllm-project#42294 is intact: prefill flows through `forward_mha` with FP8 ASM, decode flows through `forward_mqa` with sparse top-k + persistent (work-stealing) metadata — matching the pre-vllm-project#42294 reference run that delivered ~227 tok/s on DSV3.2 ISL/OSL/MC=1000/100/4 tp4. Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com>
… fix) This commit re-enables both performance paths that PR vllm-project#42294 had to disable due to GPU memory access faults: - Persistent (work-stealing) MLA decode metadata (mla_a16w16_qh16_m16x4_n16x1_coex0_mask1_ps). - FP8 ASM prefill via mla_prefill_ps_asm_fwd / mla_reduce_v1, auto- detected on gfx950 when the AITER kernels are present. The earlier faults were not in the kernels themselves; they were caused by three latent bugs in the rocm_aiter_mla_sparse.py refactor in this branch (e3c5d80), which only manifested under specific batch shapes and so were not caught by the long-prompt benchmark: 1. ROCMAiterMLASparseImpl.forward_mqa was reading attn_metadata.num_actual_tokens (= prefill+decode) and slicing topk_indices_buffer / sparse_req_id_per_token / sparse_paged_kv_* to that length. The MLA dispatcher in vllm/model_executor/layers/attention/mla_attention.py only passes mqa_q = q[:num_decode_tokens] here, so the prefill positions were indexing into: - topk_indices_buffer[num_decode:num_actual] which the indexer skips for prefill (uninitialised memory), and - decode.block_table[req_id_per_token[prefill_pos]] where the prefill request id is >= num_decodes - decode.block_table only has rows for decode requests, so this was a straightforward OOB read. This commit slices everything in forward_mqa to q.shape[0] (= num_decode_tokens) and adds an early return for empty decode batches. 2. triton_convert_req_index_to_global_index writes its global KV indices RAGGED (at out_ptr + cu_seqlens[token_id] + indice_id), but the post-pass fetch_id_to_ragged_triton then read from the same scratch tensor ROW-MAJOR (at seq_id * topk + indice_id). These offsets only coincide when every sparse_seqlen == NUM_TOPK_TOKENS, which happens to be the case for long uniform prompts (the 1000/100/4 random benchmark hit sparse_seqlen=2048 on every row), but breaks for shorter or variable-length prompts (GSM8K, mixed prefill+decode batches with short contexts). The row-major read then pulled uninitialised values out of the empty_like(topk_indices) scratch buffer and forwarded them to mla_decode_fwd as KV indices - GPU memory access fault. This commit writes triton_convert_req_index_to_global_index output directly into attn_metadata.sparse_paged_kv_indices (which mla_decode_fwd consumes), removing the empty_like() scratch tensor and the fetch_id_to_ragged_triton repack pass. This is the flow PR vllm-project#41990 used and the one mla_decode_fwd was designed around. 3. The persistent populator (aiter.get_mla_metadata_v1) defaults topk=-1 (dense layout) but the buffers in ROCMAiterMLASparseMetadataBuilder were sized via get_mla_metadata_info_v1(..., is_sparse=True). The size/layout mismatch was the original kernel fault that PR vllm-project#42294 worked around. This commit: - Passes topk=self.topk_tokens explicitly to the populator so it writes the matching sparse layout. - Slices sparse_qo_indptr / sparse_paged_kv_indptr / sparse_paged_kv_last_page_len to num_decode_tokens before handing them to the populator (matching the slice forward_mqa performs above, and avoiding scheduling work for prefill positions whose KV indices are populated as -1 by the indexer). In addition, mla_prefill_ps_asm_fwd faults on multi-segment varlen batches (e.g. 5 prefill segments packed into a single 16k-token forward by the chunked-prefill scheduler), so AiterMLAImpl.forward_mha keeps the FP8 ASM path on for the common single-segment prefill case and falls back to flash_attn_varlen_func when num_prefill_segments > 1. Verified on MI355X (gfx950), DeepSeek-V3.2 TP4, with the exact reference YAML config (num_prompts=32, max_concurrency=4, random_input_len=1000, random_output_len=100, num_warmups=4): - Server stable: bench (32/32, 0 failed), GSM8K via lm_eval (250 prompts at concurrency=4, 0 failed), repeated bench runs after GSM8K, all on the same server instance. - Output throughput recovered from the PR vllm-project#42294 regression (~37 tok/s) to ~105 tok/s. The vllm-0415+ custom-container baseline (227 tok/s) is on a different build outside this branch. - Direct sanity check: model produces correct GSM8K answers (e.g. Natalia 48 + 24 = "#### 72" exactly). Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com>
78c33c1 to
5c473e1
Compare
ChuanLi1101
left a comment
There was a problem hiding this comment.
Read it. Quick notes for tomorrow:
- Gemini #2 (
triton_convertsignature) and #3 (block_size) are stale — already fixed in6289a4733b/fbbaeb5a99. Just resolve those threads. - Gemini #1 (head padding) is real but inert for DSV3.2 (128 heads — even TP=16 → 8 heads/rank is the only crash case and nobody runs that today). Cheapest fix is to wrap the sparse
forward_mqawithAiterMLAHelper.get_mla_padded_q/get_mla_unpadded_o— same idiom as the denseforward_mqayou inherit from. Alternativelyassert num_heads >= 16inROCMAiterMLASparseImpl.__init__if you'd rather keep "DSA-only" explicit. - CI red is the
pre-run-checklabel gate, not a real lint failure. Needs a maintainer to add thereadylabel to start full CI. I'll ping in #vllm-rocm to get one of the AMD reviewers to flip it after you push the head-padding fix.
Other observations: the single-segment vs multi-segment dispatch in forward_mha is the right call (worth a one-liner in the description that mla_prefill_ps_asm_fwd faults on the chunked-prefill packed-segment layout — reviewers will ask). The _fp8_mla_prefill_supported() autodetection is clean. Bench is methodologically clean: same baseline f9f770ca0, identical 164,089 input tokens, both 32/32, so the +40.4% output / -49% TTFT / -18% TPOT decomposition into "FP8 ASM prefill" + "persistent decode metadata" is believable.
LGTM after the head-padding fix.
Wraps `ROCMAiterMLASparseImpl.forward_mqa` and `_forward_sparse_mla` with `AiterMLAHelper.get_mla_padded_q` / `get_mla_unpadded_o`, matching the dense `AiterMLAImpl.forward_mqa` we inherit from. The AITER MLA decode kernel requires `num_heads >= 16` (the head-tile size). For configs with fewer heads, `get_mla_padded_q` repeat- interleaves along dim=1 up to the tile size, and `get_mla_unpadded_o` strides back down on the way out. For DSV3.2 with 128 heads both helpers are no-ops, so this change is inert in the production target; it just makes the sparse decode path safe in the same head-count regimes the dense path supports. `_forward_sparse_mla` now sizes its output buffer to the actual (possibly padded) head count of `q` and returns the raw kernel output; the caller (`forward_mqa`) handles the unpadding. Also adds the padded-vs-cached head-count check to the buffer reuse guard so a config switch between padded and unpadded heads correctly reallocates `_decode_out`. Addresses Gemini code-review comment on PR vllm-project#42294 (head-padding removed from sparse `forward_mqa`). Co-authored-by: Cursor <cursoragent@cursor.com>
Code-review feedback from @tjtanaa on PR vllm-project#42294, plus a small defensive guard around the deepgemm dispatch: - ops/rocm_aiter_mla_sparse.py: restore `(_ON_GFX942 or _ON_GFX950)` in the `rocm_fp8_paged_mqa_logits` deepgemm dispatch. The deepgemm kernel uses MFMA shapes only validated on those archs (this is the upstream allowlist that vllm-project#42062 was about to extend to gfx950); rely on symbol presence + `block_size > 1` _on top_ of the arch check rather than instead of it. - backends/mla/rocm_aiter_mla.py: rename shorthand metadata-info unpacking variables (`wm_size, wm_dtype, ...`) to their full `work_metadata_size, work_metadata_dtype, ...` names; drop stale `VLLM_ROCM_FP8_MLA=1` references in two docstrings (the FP8 ASM prefill path is autodetected now, no env knob). - backends/mla/rocm_aiter_mla_sparse.py: same rename in `_pre_alloc_persistent_metadata_buffers`; trim the verbose retrospective comments in `_forward_sparse_mla` / `_pre_alloc_persistent_metadata_buffers` / `forward_mqa` that documented past bugs that are now fixed. - ops/rocm_aiter_mla_sparse.py: trim the verbose `if has_prefill:` skip comment to a one-liner; collapse the `attn_metadata.max_seq_len` discussion into a short TODO. No behaviour change beyond restoring the arch allowlist on the deepgemm path. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Hi @maeehart, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
@maeehart given that you also updated dense mla. Please also evaluate if deepseek-v3/r1 still works fine. Thanks. Please also address all those failures e.g. precommits etc. |
|
Closing this PR. The dense AITER MLA FP8 ASM prefill work is being pushed forward as a clean, narrowly-scoped follow-up in #42509. |
Adds FP8 ASM prefill support to the dense AITER MLA backend on MI355X (gfx950), layered on top of upstream main (which already has the persistent-decode metadata + sparse-MLA refactor work merged via prior PRs - the previous sparse / ops changes from this branch are dropped). - _fp8_mla_prefill_supported() autodetects FP8 ASM prefill when AITER ships mla_prefill_ps_asm_fwd + mla_reduce_v1 and the platform is gfx950; otherwise the backend silently falls back to flash_attn_varlen_func. - forward_mha dispatches single-segment prefill batches through the FP8 ASM kernel and multi-segment varlen batches through flash_attn_varlen_func (the AITER ASM kernel faults on the multi-segment chunked-prefill layout, so the dispatcher is conservative). - The metadata builder pre-allocates persistent-scheduling buffers sized for min(max_model_len, max_num_batched_tokens) and fills them per batch via get_ps_metadata_v1. Verified on MI355X TP=4 with DeepSeek-V3.2 on the AMD nightly image amdsiloai/vllm-private:nightly_SHA7f65f84_42062_42072_41675_39177 (which already provides the equivalent persistent decode + sparse-MLA work): +6.6% output throughput, -9.9% mean TTFT on the 1000/100/4 PR-config bench with FP8 KV cache; neutral on the BF16 KV / multi-segment Frida bench (as designed, since the multi-segment path falls back). GSM8K 5-shot 250-sample accuracy: 92.0% / 90.8% strict-match for FP8 / BF16 KV cache respectively, both within ~1.5 sigma of baseline. Co-authored-by: clintg6 <clint.greene@amd.com> Co-authored-by: frida-andersson <frida.andersson@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com>
b725e1d to
de84014
Compare
The previous squash accidentally dropped upstream main's persistent decode-metadata buffer init from AiterMLAMetadataBuilder.__init__ (self._num_attention_heads, self._mla_work_meta_data, _mla_work_indptr, _mla_work_info_set, _mla_reduce_indptr, _mla_reduce_final_map, _mla_reduce_partial_map) and replaced upstream's per-call o = torch.empty(...) decode-output allocation with a cached self._decode_out tensor. Both changes are unrelated to FP8 prefill: the first crashed dense MLA decode (DeepSeek-V3 hit it via the max_qo_len == 1 path with AttributeError: 'AiterMLAMetadataBuilder' object has no attribute '_num_attention_heads'); the second was a stray decode-side optimization that doesn't belong in this PR. Restore upstream main verbatim for both blocks. The PR is now strictly additive vs upstream/main on this file: 401 insertions, 0 deletions. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Hi @maeehart, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
forward_mha had a leftover `if self._pad_v: output_prefill = output_prefill[..., : v.shape[-1]]` block. AiterMLAImpl never sets self._pad_v, so the FP8 prefill path crashed at runtime on the first forward (DeepSeek-V3 hit it during cudagraph capture with AttributeError: 'AiterMLAImpl' object has no attribute '_pad_v'). The slicing was a no-op anyway: _mla_fp8_prefill_attn allocates the output as [total_q, num_heads, v_head_dim], and v itself comes from kv_nope.split([..., self.v_head_dim]), so v.shape[-1] == v_head_dim by construction. Drop the gate, replace with a comment explaining why no trim is needed. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Hi @maeehart, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
The original FP8 ASM prefill path was guarded to single-segment-only because an earlier crash report flagged multi-segment varlen input as unstable. Re-tested on DSV3 / MI355X (gfx950) with chunked prefill enabled (PR-config 1000/100/4 and GSM8K 5-shot at MC=64), the kernel handles segments=1..7 cleanly and packs up to 8185 tokens per forward without faults. Removing the guard so prefill engages under real serving load. Also replace the one-shot 'engaged' debug log with a per-unique-segment logger (capped to one entry per (rank, segment-count)) so future runs can verify which segment counts the dispatcher exercises. Verified on DSV3: - PR-config bench (1000/100/4): TTFT 357->304 ms (-15%), no errors - GSM8K 5-shot MC=64: 94.5% (within 1 std err of varlen-only fallback) - Segment distribution observed: 1, 2, 3, 4, 6, 7 Co-authored-by: Cursor <cursoragent@cursor.com>
|
Hi @maeehart, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Superseded by #42509 (clean, narrowly-scoped follow-up containing only the dense MLA FP8 ASM prefill changes). |
Summary
Adds FP8 ASM prefill for the dense AITER MLA backend on DeepSeek-V3.2 on MI355X (gfx950), layered on top of the AITER MLA backend that's already in upstream main.
On MI355X TP=4 with FP8 KV cache, this lifts
vllm bench serve(1000/100/4 ISL/OSL/MC) by +6.6 % output throughput and -9.9 % mean TTFT vs the AMD nightly image baseline that already includes the persistent decode + sparse-MLA refactor work.Credits
AiterMLA-based inheritance) that this PR builds on top of.What changed since the previous review round
The previous version of this PR touched three files (
rocm_aiter_mla.py,rocm_aiter_mla_sparse.py,vllm/v1/attention/ops/rocm_aiter_mla_sparse.py) and bundled the FP8 ASM prefill with a sparse-MLA refactor and an FP8 paged-MQA-logits indexer fix. Since then the equivalents of the sparse refactor and the indexer fix are worked on via separate PRs (#41675 and friends), so this branch has been rebased onto currentupstream/mainand scoped down to just the FP8 ASM prefill changes inrocm_aiter_mla.py.Concretely:
upstream/main(HEAD6ff7405b8) and a single squashed commit applied on top.rocm_aiter_mla_sparse.py,rocm_aiter_mla_sparse_dsv4.py, andvllm/v1/attention/ops/rocm_aiter_mla_sparse.pyare now identical to upstream main - no longer modified by this PR.tjtanaareview feedback that applies to the dense file is still in place from the previous round (full descriptive variable names, no references to retired env vars likeVLLM_ROCM_FP8_MLA, trimmed comments).forward_mqafornum_heads < 16(ChuanLi1101 Fix a bug in tying OPT embeddings #1 / Gemini Fix a bug in tying OPT embeddings #1) is moot here since this PR no longer changesforward_mqa.What this PR changes
File:
vllm/v1/attention/backends/mla/rocm_aiter_mla.py_fp8_mla_prefill_supported()autodetects FP8 ASM prefill on gfx950 when AITER shipsmla_prefill_ps_asm_fwd+mla_reduce_v1. No env var needed; falls back silently toflash_attn_varlen_funcwhen either is missing.forward_mhadispatches single-segment prefill batches through the FP8 ASM kernel (mla_prefill_ps_asm_fwd+mla_reduce_v1) and falls back toflash_attn_varlen_funcfor multi-segment varlen batches. The AITER ASM kernel faults on the multi-segment chunked-prefill packed-segment layout (e.g. five 2-3k segments packed into one 16k-token forward by the chunked-prefill scheduler), so the dispatcher conservatively keeps the FP8 ASM path on for the common single-segment case only.min(max_model_len, max_num_batched_tokens)and fills them per batch viaaiter.get_ps_metadata_v1.Verification
Hardware: MI355X (gfx950) TP=4. DeepSeek-V3.2.
Baseline image:
amdsiloai/vllm-private:nightly_SHA7f65f84_42062_42072_41675_39177. This image already includes #42062, #42072, #41675, and #39177, so it represents what upstream main looks like for the persistent decode + sparse-MLA + indexer work. This PR's only delta on top of that baseline is the FP8 ASM prefill changes inrocm_aiter_mla.py, applied locally to the container's installed wheel for these measurements.Server flags:
Env:
VLLM_ROCM_USE_AITER=1,VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1,VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4,HIP_VISIBLE_DEVICES=0,1,2,3.Performance
Two benchmark shapes were run on each of {baseline image, baseline + this PR} × {FP8 KV, BF16 KV} = 4 configurations.
PR-config bench (
vllm bench serve --dataset-name random --random-input-len 1000 --random-output-len 100 --num-prompts 32 --max-concurrency 4 --num-warmups 4 --ignore-eos --seed 1):auto)auto)auto)The 1000/100/4 shape produces single-segment prefill batches and hits the FP8 ASM kernel. The FP8-KV gain (+6.6 % throughput, -9.9 % TTFT) is the headline. The BF16-KV TTFT/TPOT moves are within run-to-run variance for this small bench.
Longer context bench (
vllm bench serve --dataset-name random --random-input-len 5000 --random-output-len 500 --num-prompts 256 --max-concurrency 64 --num-warmups 64 --ignore-eos --seed 931):auto)auto)The 5000/500/64 shape (256 prompts, conc 64, ISL=5000) packs multiple varlen prefill segments per scheduler step to fill
--max-num-batched-tokens=16384, soforward_mha'snum_prefill_segments > 1guard intentionally falls back toflash_attn_varlen_func. Performance is neutral (within ~0.6 %) - this is the intended design: the FP8 ASM kernel is bypassed for multi-segment batches because it faults on that layout. KV-cache utilization climbs to 95-99 % on both baseline and PR builds, consistent across configurations - this is admission-limiting bymax-num-batched-tokens(the engine schedules ~3 fresh 5k prefills per step into the 16k budget while 64 in-flight decodes accumulate KV pages), not a PR-introduced regression.Accuracy
lm_eval --model local-completions --tasks gsm8k --num_fewshot 5 --limit 250 --gen_kwargs 'temperature=0.0,max_gen_toks=512' --model_args 'model=deepseek-ai/DeepSeek-V3.2,base_url=http://localhost:30003/v1/completions,num_concurrent=4,max_retries=3,tokenized_requests=False,tokenizer_backend=None'auto)auto)The ~2.4 pp drop from the FP8 ASM prefill kernel is within ~1.5σ of the 250-sample stderr (~1.6-1.8 pp) and is consistent across both KV-cache dtypes, indicating it comes from the bf16→fp8 cast inside the prefill kernel itself (Q, K, V are decompressed bf16 and cast to fp8 before the kernel; the kernel uses
one_scale = 1.0for all three tensor scales rather than per-tensor max-abs scaling). All four configurations remain comfortably above 90 % strict-match. 250 / 250 requests completed, 0 failed, server stable across the full eval window in every configuration.Stability sanity checks
EngineDeadError.Memory access fault by GPUentries.Cross-model verification:
amd/Kimi-K2.5-MXFP4The same FP8 ASM prefill changes were surgically applied on top of
amdsiloai/vllm:20262304-kimi-k25-mxfp4-optimized-mi355x(a custom-branch image with kimi-specific MLA decode optimizations) on MI355X TP=2 to verify the PR does not regress other MLA models and that the FP8 ASM kernel actually fires.Server flags (from the AWS ITT config):
Env:
VLLM_ROCM_USE_AITER=1,VLLM_ROCM_AITER_FUSED_MLA_DECODE=1,VLLM_ROCM_USE_AITER_FUSED_AR_RMSNORM=1,VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1,VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4.FP8 ASM prefill launches confirmed: with logging gated on
q.shape[0] >= 256, the patched server logged 2196 launches ofmla_prefill_ps_asm_fwdwithtotal_q=10000during the AWS ITT bench (32 prompts × 10000-token prefills × 60 layers × 2 ranks; the scheduler emits each fresh 10k prefill as its own single-segment batch, soforward_mha'snum_prefill_segments > 1guard does not engage on this workload).ITT bench (
vllm bench serve --dataset-name random --random-input-len 10000 --random-output-len 1000 --num-prompts 32 --max-concurrency 4 --num-warmups 4 --seed 2 --ignore-eos), warm-cache runs:No regression and a modest gain (mainly TTFT, consistent with the FP8 ASM prefill kernel being on the critical path of the 10k-token prefill). Both runs use the kimi-image's fused decode (
fuses_rope_in_decode) which this PR leaves untouched.Test plan
--max-num-batched-tokens 16384and captures both mixed prefill-decode and FULL decode CUDA graphs without faulting.vllm bench serve1000/100/4 reference shape: 32/32 completed on all four configs; +6.6 % output throughput / -9.9 % mean TTFT vs baseline on FP8 KV.lm_eval gsm8k5-shot 250 samples: 250/250 completed on all four configs; strict-match ≥ 90 % in every case.flash_attn_varlen_funcengaged as designed; throughput and TPOT match baseline image within ~0.6 %.rocm_aiter_mla.pyis modified relative to upstream main.