0xSero/gemma-4-21b-a4b-it-REAP
language:
- en license: gemma tags:
- safetensors
- gemma4
- moe
- pruning
- reap
- cerebras
- expert-pruning base_model:
- google/gemma-4-26b-a4b-it library_name: transformers pipeline_tag: text-generation
Gemma 4 21B-A4B-it REAP
20% expert-pruned version of google/gemma-4-26b-a4b-it using Cerebras REAP (Router-weighted Expert Activation Pruning).
| | Original | This Model (0.20) | 0.30 variant | |---|---:|---:|---:| | Total params | ~26B | 21.34B | 19.02B | | Experts per layer | 128 | 103 | 90 | | Active params/tok | ~4B | ~4B | ~4B | | Experts/tok | 8 | 8 | 8 | | Format | BF16 | BF16 | BF16 | | Disk size | ~52 GB | ~43 GB | ~36 GB |
REAP removes 20% of MoE experts (25 of 128 per layer) while preserving the model's routing behavior. The active parameter count per token is unchanged since the router still selects 8 experts per token from the remaining pool. This yields an ~18% reduction in total disk/memory footprint.
How This Model Was Made
Step 1: Calibration (Activation Observation)
We ran the full Gemma 4 26B-A4B-it model over a curated calibration dataset to record expert activation patterns. The observer hooks capture router gate values, expert activation norms, and routing frequencies for every layer across all calibration tokens.
Calibration dataset: 22,000 samples drawn from 12 sources covering coding, reasoning, math, science, tool-calling, and agentic tasks:
| Category | Samples | Source Dataset |
|----------|--------:|----------------|
| Coding (general) | 1,000 | theblackcat102/evol-codealpaca-v1 |
| Coding (additional) | 1,636 | theblackcat102/evol-codealpaca-v1 |
| Reasoning -- code | 3,480 | open-r1/Mixture-of-Thoughts[code] |
| Reasoning -- math | 3,578 | open-r1/Mixture-of-Thoughts[math] |
| Reasoning -- science | 3,576 | open-r1/Mixture-of-Thoughts[science] |
| Tool calling | 1,000 | Salesforce/xlam-function-calling-60k |
| Agentic coding | 1,000 | SWE-bench/SWE-smith-trajectories |
| Biomedical QA | 800 | qiaojin/PubMedQA[pqa_labeled] |
| Science QA | 800 | derek-thomas/ScienceQA |
| Grade-school math | 4,466 | openai/gsm8k[main] |
| Competition math | 500 | HuggingFaceH4/MATH-500 |
| Code correctness | 164 | evalplus/humanevalplus |
| Total | 22,000 | |
Step 2: REAP Pruning
Using the recorded activation data, REAP scores each expert's importance per layer by combining router gate values, expert activation norms, and frequency-weighted saliency. The lowest-scoring 20% of experts (25 per layer) are removed. Router logits are renormalized post-pruning to maintain the output distribution.
Pruning Configuration
| Parameter | Value | |-----------|-------| | Compression ratio | 0.20 (20% expert removal) | | Original experts per layer | 128 | | Remaining experts per layer | 103 | | Pruning method | REAP | | Distance measure | Angular (cosine) | | Router weight renormalization | Yes | | Seed | 42 |
Benchmark Results
Accuracy (generative, 0-shot, 50 samples/task, thinking enabled, vLLM 0.19, 4x RTX 3090)
Evaluated using lm-eval generative tasks with --apply_chat_template and think_end_token=<channel|> to properly handle Gemma 4's thinking mode. Scores extracted from model responses using regex matching.
| Task | Original | REAP 0.20 | REAP 0.30 | |------|-------:|-------:|-------:| | Elementary Math | 92% | 90% | 88% | | Philosophy | 92% | 88% | 74% | | World Religions | 90% | 64% | 48% | | College CS | 56% | 76% | 68% | | HS Math | 24%* | 44%* | 48%* | | Abstract Algebra | 12%* | 28%* | 28%* | | College Math | 16%* | 18%* | 24%* | | GSM8K | 86% | 84% | -- |
* Tasks with significant extraction failures (model outputs equations rather than single letters). Real accuracy likely higher for all models.
Notes:
- Gemma 4 is a thinking model -- it reasons internally before answering. Standard loglikelihood-based benchmarks give incorrect results because the model wants to think first.
- GSM8K uses flexible-extract which handles thinking output well.
- College CS and math tasks show REAP sometimes outperforming the original, likely due to sampling variance at n=50.
Generation Quality: Side-by-Side (14 prompts, temp=0.7, top_p=0.9, max 2048 tokens)
Both the original and REAP 0.20 models were tested on 14 challenging prompts across coding, math, philosophy, long-context, and repetition stress with proper chat template formatting.
| Domain | N | Orig AvgWords | REAP AvgWords | Orig Loop | REAP Loop | Orig Collapse | REAP Collapse | |--------|--:|-------------:|--------------:|----------:|----------:|--------------:|--------------:| | Coding | 3 | 670 | 648 | 0% | 0% | 0% | 0% | | Math reasoning | 3 | 296 | 261 | 0% | 0% | 0% | 0% | | Philosophy | 3 | 819 | 727 | 0% | 0% | 0% | 0% | | Long context | 2 | 1210 | 854 | 50% | 0% | 0% | 0% | | Repetition stress | 3 | 1088 | 1099 | 33% | 33% | 0% | 0% |
12/14 clean ties, 1 REAP win (long-context), 1 mutual mild loop (sorting algorithms). The REAP 0.20 model is essentially indistinguishable from the original on generation quality.
Architecture
Gemma 4 uses a hybrid sliding/full attention MoE architecture:
- 30 transformer layers
- Sliding attention (window=1024) for 25 layers, full attention every 6th layer
- MoE FFN with 103 remaining experts per layer (originally 128), 8 active per token
- Thinking model -- uses
<|channel>thought/<|channel>responsechannels - Multimodal -- supports text and vision inputs
- Context window: 262,144 tokens
- Vocab size: 262,144
Usage
Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "0xSero/gemma-4-21b-a4b-it-REAP"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto", trust_remote_code=True)
messages = [{"role": "user", "content": "Write a quicksort in Python."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4096)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
vLLM
pip install vllm>=0.19 transformers>=5.0
vllm serve 0xSero/gemma-4-21b-a4b-it-REAP \
--tensor-parallel-size 2 \
--enforce-eager \
--gpu-memory-utilization 0.9 \
--max-model-len 16384 \
--trust-remote-code
Citation
@inproceedings{lasby2025reap,
title={{REAP} the Experts: Why Pruning Prevails for One-Shot {MoE} Compression},
author={Lasby, Mike and others},
booktitle={International Conference on Learning Representations (ICLR)},
year={2026},
url={https://arxiv.org/abs/2510.13999}
}
Links
- REAP paper: arxiv.org/abs/2510.13999
- REAP code: github.com/cerebras/reap
- 30% pruned variant: 0xSero/gemma-4-19b-a4b-it-REAP
- Base model: google/gemma-4-26b-a4b-it
Author: 0xSero
Likes: 34
Downloads: 0
Tags: transformers, safetensors, gemma4, image-text-to-text, moe, pruning, reap, cerebras, expert-pruning, text-generation, conversational, en, arxiv:2510.13999, license:gemma, endpoints_compatible, region:us