Todays AI Summary

AI Developments: New Models Emerge and Research Advances

Here's a summary of the latest AI model releases and research papers:

Research Papers

Several interesting research papers have emerged, focusing on diverse areas of AI:

  • SUB: Benchmarking CBM Generalization via Synthetic Attribute Substitutions: This paper introduces a new benchmark, SUB, for evaluating the robustness of Concept Bottleneck Models (CBMs) under distribution shifts. It uses synthetic images with concept substitutions to rigorously test CBMs, contributing to the development of more reliable interpretable models.
  • Phi-Ground Tech Report: Advancing Perception in GUI Grounding: This report details the development of the Phi-Ground model family, achieving state-of-the-art performance in GUI grounding tasks. The research provides insights into data collection and model training for improving perception in computer use agents. The model achieves SOTA results with scores of 43.2 on ScreenSpot-pro and 27.2 on UI-Vision.
  • SimuRA: Towards General Goal-Oriented Agent via Simulative Reasoning Architecture with LLM-Based World Model: This paper introduces SimuRA, a goal-oriented architecture for generalized agentic reasoning. It uses an LLM-based world model for planning via simulation, overcoming limitations of autoregressive LLMs. Experiments on web browsing tasks show significant improvements in success rates compared to autoregressive planning.
  • CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks: This paper proposes CoT-Self-Instruct, a synthetic data generation method that instructs LLMs to first reason and plan via Chain-of-Thought (CoT) based on the given seed tasks, and then to generate a new synthetic prompt of similar quality and complexity for use in LLM training, followed by filtering for high-quality data with automatic metrics.
  • Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving: This paper introduces Seed-Prover, a lemma-style whole-proof reasoning model for automated theorem proving. Seed-Prover can iteratively refine its proof based on Lean feedback, proved lemmas, and self-summarization. Seed-Prover proves 78.1% of formalized past IMO problems, saturates MiniF2F, and achieves over 50% on PutnamBench, outperforming the previous state-of-the-art by a large margin.

Model Releases

Several new models have been released, targeting various applications:

  • ubergarm/GLM-4.5-Air-GGUF: This model is a GGUF quantization of zai-org/GLM-4.5-Air, designed for conversational text generation. It requires the ik_llama.cpp fork for support and offers perplexity improvements for its memory footprint.
  • yasserrmd/DentaInstruct-1.2B: This model is a fine-tuned instruction-following language model for dental domain queries. It is based on LiquidAI/LFM2-1.2B and fine-tuned with the Unsloth library. It demonstrates excellent terminology handling and instruction following in the dental domain.
  • Alpha-VLLM/Lumina-mGPT-2.0-Omni: This model is a stand-alone, decoder-only autoregressive model trained from scratch, that unifies a broad spectrum of image generation tasks, including text-to-image generation, image pair generation, subject-driven generation, multi-turn image editing, controllable generation, and dense prediction.

Key Takeaways

  • Robustness and Interpretability: Research is focusing on improving the robustness and interpretability of AI models, particularly in critical applications like medicine.
  • GUI Grounding Advances: Significant progress is being made in GUI grounding, enabling more effective computer use agents.
  • Reasoning and Planning: New architectures are emerging to enhance reasoning and planning capabilities in AI agents, leveraging LLMs and world models.
  • Domain-Specific Models: Fine-tuned models for specific domains, such as dentistry, are demonstrating strong performance in their respective areas.
  • Image Generation: New models are unifying various image generation tasks, showcasing the potential for versatile image creation.

AI Papers for 2026-04-22

MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and task diversity. We introduce MathNet, a high-quality, large-scale, multimodal, and multilingual dataset of Olympiad-level math problems together with a benchmark for evaluating mathematical reasoning in generative models and mathematical retrieval in embedding-based systems. MathNet spans 47 countries, 17 languages, and two decades of competitions, comprising 30,676 expert-authored problems with solutions across diverse domains. In addition to the core dataset, we construct a retrieval benchmark consisting of mathematically equivalent and structurally similar problem pairs curated by human experts. MathNet supports three tasks: (i) Problem Solving, (ii) Math-Aware Retrieval, and (iii) Retrieval-Augmented Problem Solving. Experimental results show that even state-of-the-art reasoning models (78.4% for Gemini-3.1-Pro and 69.3% for GPT-5) remain challenged, while embedding models struggle to retrieve equivalent problems. We further show that retrieval-augmented generation performance is highly sensitive to retrieval quality; for example, DeepSeek-V3.2-Speciale achieves gains of up to 12%, obtaining the highest scores on the benchmark. MathNet provides the largest high-quality Olympiad dataset together with the first benchmark for evaluating mathematical problem retrieval, and we publicly release both the dataset and benchmark at https://mathnet.mit.edu.

Sessa: Selective State Space Attention

Modern sequence models are dominated by Transformers, where self-attention mixes information from the visible context in an input-dependent way. However, when retrieval is not sharp and attention remains diffuse over an effective support $S_{\mathrm{eff}}(t)$, the influence of any individual token is diluted, typically scaling as $O(1/S_{\mathrm{eff}}(t))$ and reaching $O(1/\ell)$ for old tokens in full-prefix settings. Structured state-space models process sequences recurrently through an explicit feedback path; selective variants such as Mamba make this feedback input-dependent, yet when freeze time cannot be sustained over long intervals, their long-range sensitivity decays exponentially with lag. Existing architectures therefore either retrieve from the past in a single read or propagate information through a single feedback chain. We introduce Sessa, a decoder that places attention inside a feedback path, enabling recurrent many-path aggregation within a layer. Under stated assumptions, Sessa admits regimes with a power-law memory tail in lag $\ell$ of order $O(\ell^{-β})$ for $0<β<1$, which is asymptotically slower than $1/\ell$; moreover, this rate is tight in an explicit diffuse uniform-routing setting where the influence is $Θ(\ell^{-β})$. Under the same conditions, only Sessa among the compared model classes realizes flexible selective retrieval, including non-decaying profiles. Empirically, under matched architectures and training budgets, Sessa achieves the strongest performance on our long-context benchmarks while remaining competitive with Transformer and Mamba style baselines on short-context language modeling.

Bounded Ratio Reinforcement Learning

Proximal Policy Optimization (PPO) has become the predominant algorithm for on-policy reinforcement learning due to its scalability and empirical robustness across domains. However, there is a significant disconnect between the underlying foundations of trust region methods and the heuristic clipped objective used in PPO. In this paper, we bridge this gap by introducing the Bounded Ratio Reinforcement Learning (BRRL) framework. We formulate a novel regularized and constrained policy optimization problem and derive its analytical optimal solution. We prove that this solution ensures monotonic performance improvement. To handle parameterized policy classes, we develop a policy optimization algorithm called Bounded Policy Optimization (BPO) that minimizes an advantage-weighted divergence between the policy and the analytic optimal solution from BRRL. We further establish a lower bound on the expected performance of the resulting policy in terms of the BPO loss function. Notably, our framework also provides a new theoretical lens to interpret the success of the PPO loss, and connects trust region policy optimization and the Cross-Entropy Method (CEM). We additionally extend BPO to Group-relative BPO (GBPO) for LLM fine-tuning. Empirical evaluations of BPO across MuJoCo, Atari, and complex IsaacLab environments (e.g., Humanoid locomotion), and of GBPO for LLM fine-tuning tasks, demonstrate that BPO and GBPO generally match or outperform PPO and GRPO in stability and final performance.

Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs

We present BLF (Bayesian Linguistic Forecaster), an agentic system for binary forecasting that achieves state-of-the-art performance on the ForecastBench benchmark. The system is built on three ideas. (1) A Bayesian linguistic belief state: a semi-structured representation combining numerical probability estimates with natural-language evidence summaries, updated by the LLM at each step of an iterative tool-use loop. This contrasts with the common approach of appending all retrieved evidence to an ever-growing context. (2) Hierarchical multi-trial aggregation: running $K$ independent trials and combining them using logit-space shrinkage with a data-dependent prior. (3) Hierarchical calibration: Platt scaling with a hierarchical prior, which avoids over-shrinking extreme predictions for sources with skewed base rates. On 400 backtesting questions from the ForecastBench leaderboard, BLF outperforms all the top public methods, including Cassi, GPT-5, Grok~4.20, and Foresight-32B. Ablation studies show that the structured belief state is as impactful as web search access, and that shrinkage aggregation and hierarchical calibration each provide significant additional gains. In addition, we develop a robust back-testing framework with a leakage rate below 1.5\%, and use rigorous statistical methodology to compare different methods while controlling for various sources of noise.

When Can LLMs Learn to Reason with Weak Supervision?

Large language models have achieved significant reasoning improvements through reinforcement learning with verifiable rewards (RLVR). Yet as model capabilities grow, constructing high-quality reward signals becomes increasingly difficult, making it essential to understand when RLVR can succeed under weaker forms of supervision. We conduct a systematic empirical study across diverse model families and reasoning domains under three weak supervision settings: scarce data, noisy rewards, and self-supervised proxy rewards. We find that generalization is governed by training reward saturation dynamics: models that generalize exhibit a prolonged pre-saturation phase during which training reward and downstream performance climb together, while models that saturate rapidly memorize rather than learn. We identify reasoning faithfulness, defined as the extent to which intermediate steps logically support the final answer, as the pre-RL property that predicts which regime a model falls into, while output diversity alone is uninformative. Motivated by these findings, we disentangle the contributions of continual pre-training and supervised fine-tuning, finding that SFT on explicit reasoning traces is necessary for generalization under weak supervision, while continual pre-training on domain data amplifies the effect. Applied together to Llama3.2-3B-Base, these interventions enable generalization across all three settings where the base model previously failed.

Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale

The Platonic Representation Hypothesis suggests that neural networks trained on different modalities (e.g., text and images) align and eventually converge toward the same representation of reality. If true, this has significant implications for whether modality choice matters at all. We show that the experimental evidence for this hypothesis is fragile and depends critically on the evaluation regime. Alignment is measured using mutual nearest neighbors on small datasets ($\approx$1K samples) and degrades substantially as the dataset is scaled to millions of samples. The alignment that remains between model representations reflects coarse semantic overlap rather than consistent fine-grained structure. Moreover, the evaluations in Huh et al. are done in a one-to-one image-caption setting, a constraint that breaks down in realistic many-to-many settings and further reduces alignment. We also find that the reported trend of stronger language models increasingly aligning with vision does not appear to hold for newer models. Overall, our findings suggest that the current evidence for cross-modal representational convergence is considerably weaker than subsequent works have taken it to be. Models trained on different modalities may learn equally rich representations of the world, just not the same one.

A multimodal and temporal foundation model for virtual patient representations at healthcare system scale

Modern medicine generates vast multimodal data across siloed systems, yet no existing model integrates the full breadth and temporal depth of the clinical record into a unified patient representation. We introduce Apollo, a multimodal temporal foundation model trained and evaluated on over three decades of longitudinal hospital records from a major US hospital system, composed of 25 billion records from 7.2 million patients, representing 28 distinct medical modalities and 12 major medical specialties. Apollo learns a unified representation space integrating over 100 thousand unique medical events in our clinical vocabulary as well as images and clinical text. This "atlas of medical concepts" forms a computational substrate for modeling entire patient care journeys comprised of sequences of structured and unstructured events, which are compressed by Apollo into virtual patient representations. To assess the potential of these whole-patient representations, we created 322 prognosis and retrieval tasks from a held-out test set of 1.4 million patients. We demonstrate the generalized clinical forecasting potential of Apollo embeddings, including predicting new disease onset risk up to five years in advance (95 tasks), disease progression (78 tasks), treatment response (59 tasks), risk of treatment-related adverse events (17 tasks), and hospital operations endpoints (12 tasks). Using feature attribution techniques, we show that model predictions align with clinically-interpretable multimodal biomarkers. We evaluate semantic similarity search on 61 retrieval tasks, and moreover demonstrate the potential of Apollo as a multimodal medical search engine using text and image queries. Together, these modeling capabilities establish the foundation for computable medicine, where the full context of patient care becomes accessible to computational reasoning.

Latent Phase-Shift Rollback: Inference-Time Error Correction via Residual Stream Monitoring and KV-Cache Steering

Large language models frequently commit unrecoverable reasoning errors mid-generation: once a wrong step is taken, subsequent tokens compound the mistake rather than correct it. We introduce $\textbf{Latent Phase-Shift Rollback}$ (LPSR): at each generation step, we monitor the residual stream at a critical layer lcrit, detect abrupt directional reversals (phase shifts) via a cosine-similarity $+$ entropy dual gate, and respond by rolling back the KV-cache and injecting a pre-computed steering vector. No fine-tuning, gradient computation, or additional forward passes are required. LPSR achieves $\mathbf{44.0\%}$ on MATH-500 with an 8B model versus $28.8\%$ for standard AR ($+15.2$ pp; McNemar $χ^2 = 66.96$, $p < 10^{-15}$). Critically, prompted self-correction, the most natural inference-time baseline, scores only $19.8\%$, below standard AR; LPSR exceeds it by $+24.2$ pp ($χ^2 = 89.4$, $p \approx 0$). LPSR also outperforms Best-of-16 ($+7.8$ pp) at $5.4\times$ lower token cost, and surpasses a standard 70B model ($35.2\%$) with $8.75\times$ fewer parameters at ${\sim}3\times$ the token budget. A 32-layer sweep reveals a novel \textbf{detection-correction dissociation}: error-detection AUC peaks at layer~14 ($0.718$) but task accuracy peaks at layer~16 ($44.0\%$ vs.\ $29.2\%$), demonstrating that optimal monitoring depth differs for detection and correction.

Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion

We present a systematic evaluation of large language model families -- spanning both proprietary cloud APIs and locally-hosted open-source models -- on two purpose-built benchmarks for System Dynamics AI assistance: the \textbf{CLD Leaderboard} (53 tests, structured causal loop diagram extraction) and the \textbf{Discussion Leaderboard} (interactive model discussion, feedback explanation, and model building coaching). On CLD extraction, cloud models achieve 77--89\% overall pass rates; the best local model reaches 77\% (Kimi~K2.5~GGUF~Q3, zero-shot engine), matching mid-tier cloud performance. On Discussion, the best local models achieve 50--100\% on model building steps and 47--75\% on feedback explanation, but only 0--50\% on error fixing -- a category dominated by long-context prompts that expose memory limits in local deployments. A central contribution of this paper is a systematic analysis of \textit{model type effects} on performance: we compare reasoning vs.\ instruction-tuned architectures, GGUF (llama.cpp) vs.\ MLX (mlx\_lm) backends, and quantization levels (Q3 / Q4\_K\_M / MLX-3bit / MLX-4bit / MLX-6bit) across the same underlying model families. We find that backend choice has larger practical impact than quantization level: mlx\_lm does not enforce JSON schema constraints, requiring explicit prompt-level JSON instructions, while llama.cpp grammar-constrained sampling handles JSON reliably but causes indefinite generation on long-context prompts for dense models. We document the full parameter sweep ($t$, $p$, $k$) for all local models, cleaned timing data (stuck requests excluded), and a practitioner guide for running 671B--123B parameter models on Apple~Silicon.

ClawEnvKit: Automatic Environment Generation for Claw-Like Agents

Constructing environments for training and evaluating claw-like agents remains a manual, human-intensive process that does not scale. We argue that what is needed is not just a dataset, but an automated pipeline capable of generating diverse, verified environments on demand. To this end, we introduce ClawEnvKit, an autonomous generation pipeline that instantiates this formalism from natural language descriptions. The pipeline comprises three modules: (1) a parser that extracts structured generation parameters from natural language input; (2) a generator that produces the task specification, tool interface, and scoring configuration; and (3) a validator that enforces feasibility, diversity, structural validity, and internal consistency across the generated environments. Using ClawEnvKit, we construct Auto-ClawEval, the first large-scale benchmark for claw-like agents, comprising 1,040 environments across 24 categories. Empirically, Auto-ClawEval matches or exceeds human-curated environments on coherence and clarity at 13,800x lower cost. Evaluated across 4 model families and 8 agent harness frameworks, we find that harness engineering boosts performance by up to 15.7 percentage points over a bare ReAct baseline, completion remains the primary axis of variation with no model saturating the benchmark, and automated generation enables evaluation at a scale previously infeasible. Beyond static benchmarking, ClawEnvKit enables live evaluation: users describe a desired capability in natural language and obtain a verified environment on demand, turning evaluation into a continuous, user-driven process. The same mechanism serves as an on-demand training environment generator, producing task distributions that adapt to an agent's current weaknesses rather than being bounded by existing user logs.

AI Models

huihui-ai/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated


license: apache-2.0 language:

  • en library_name: transformers pipeline_tag: text-generation base_model:
  • lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled tags:
  • text-generation
  • reasoning
  • distillation
  • chain-of-thought
  • qwen
  • qwen3.6
  • mixture-of-experts
  • moe
  • lora
  • unsloth
  • abliterated
  • uncensored

huihui-ai/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated

This is an uncensored version of lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled created with abliteration (see remove-refusals-with-transformers to know more about it). This is a crude, proof-of-concept implementation to remove refusals from an LLM model without using TransformerLens.

ollama

Please use the latest version of ollama

You can use huihui_ai/qwen3.6-abliterated:35b-Claude-4.7 directly,

ollama run huihui_ai/Qwen3.6-abliterated:35b-Claude-4.7

Usage Warnings

  • Risk of Sensitive or Controversial Outputs: This model’s safety filtering has been significantly reduced, potentially generating sensitive, controversial, or inappropriate content. Users should exercise caution and rigorously review generated outputs.

  • Not Suitable for All Audiences: Due to limited content filtering, the model’s outputs may be inappropriate for public settings, underage users, or applications requiring high security.

  • Legal and Ethical Responsibilities: Users must ensure their usage complies with local laws and ethical standards. Generated content may carry legal or ethical risks, and users are solely responsible for any consequences.

  • Research and Experimental Use: It is recommended to use this model for research, testing, or controlled environments, avoiding direct use in production or public-facing commercial applications.

  • Monitoring and Review Recommendations: Users are strongly advised to monitor model outputs in real-time and conduct manual reviews when necessary to prevent the dissemination of inappropriate content.

  • No Default Safety Guarantees: Unlike standard models, this model has not undergone rigorous safety optimization. huihui.ai bears no responsibility for any consequences arising from its use.

Donation

Your donation helps us continue our further development and improvement, a cup of coffee can do it.
  • bitcoin:
  bc1qqnkhuchxw0zqjh2ku3lu4hq45hc6gy84uk70ge
  • Support our work on Ko-fi!

Author: huihui-ai

Likes: 15

Downloads: 0

Tags: transformers, safetensors, qwen3_5_moe, image-text-to-text, text-generation, reasoning, distillation, chain-of-thought, qwen, qwen3.6, mixture-of-experts, moe, lora, unsloth, abliterated, uncensored, conversational, en, base_model:lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled, base_model:adapter:lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled, license:apache-2.0, endpoints_compatible, region:us

florianleibert/kimi-k26-dflash-mi300x


license: apache-2.0 tags:

  • dflash
  • speculative-decoding
  • amd
  • mi300x
  • rocm
  • vllm
  • inference
  • optimization
  • kimi
  • moe language:
  • en base_model:
  • moonshotai/Kimi-K2.6
  • z-lab/Kimi-K2.5-DFlash

Kimi K2.6 + DFlash: 508 tok/s on 8x MI300X

<p align="center"> <strong>5.6x throughput improvement</strong> over baseline autoregressive serving<br> <em>90 tok/s → 508 tok/s on the same hardware, same model, zero quality loss</em> </p>

Performance

Throughput Scaling

<p align="center"> <img src="assets/throughput-scaling.png" alt="Throughput scaling chart showing 90 to 508 tok/s" width="900"> </p>

Head-to-Head: DFlash vs Autoregressive

| | Autoregressive (baseline) | DFlash st=2 (this config) | Speedup | |---|---:|---:|---:| | 8 users | 90.4 tok/s | 127.1 tok/s | 1.4x | | 12 users | 125.1 tok/s | 192.8 tok/s | 1.5x | | 16 users | — | 250.8 tok/s | — | | 24 users | — | 379.0 tok/s | — | | 32 users | — | 507.6 tok/s | 5.6x |

All measurements: no prefix cache, warmed server, 512 max tokens, temperature=0, prompts from a diverse reasoning benchmark set. Latency is flat at ~30s regardless of concurrency.

Per-User Latency

<p align="center"> <img src="assets/latency-flat.png" alt="Latency stays flat as concurrency scales" width="750"> </p>

| Concurrent users | Mean latency | P95 latency | Per-user tok/s | |---:|---:|---:|---:| | 8 | 31.0s | 31.3s | 15.9 | | 16 | 30.8s | 31.1s | 15.7 | | 24 | 30.0s | 30.4s | 15.8 | | 32 | 30.7s | 31.0s | 15.9 |

Latency does not degrade as concurrency increases. Each user gets a consistent ~15.8 tok/s regardless of how many others are being served.


What is this?

A production-ready serving configuration for moonshotai/Kimi-K2.6 using DFlash speculative decoding with the z-lab/Kimi-K2.5-DFlash draft model, optimized for AMD MI300X GPUs.

This is not a new model — it's an optimized serving recipe. The model weights are unchanged. Output quality is identical to standard autoregressive serving.

Three optimizations that delivered 5.6x

<p align="center"> <img src="assets/optimization-journey.png" alt="Optimization journey from 90 to 508 tok/s" width="750"> </p>

| What | Before | After | Impact | |---|---|---|---| | NUMA balancing | Enabled | Disabled | Removed memory access bottleneck across NUMA domains | | DFlash spec tokens | 8 | 2 | Acceptance rate: 16% → 50%. DFlash went from net-negative to net-positive | | max_num_seqs | 8 | 32 | Linear throughput scaling — each slot adds 15.8 tok/s |


Hardware

<p align="center"> <img src="assets/hardware-stack.png" alt="Hardware and software stack" width="800"> </p>

| Component | Specification | |---|---| | GPU | 8x AMD Instinct MI300X | | GPU Architecture | CDNA 3 (gfx942) | | VRAM per GPU | 192 GB HBM3 | | Total VRAM | 1,536 GB (1.5 TB) | | System RAM | ~2 TB | | Storage | NVMe (14 TB), model on local disk | | Runtime | vLLM v0.19.2 ROCm nightly | | ROCm Version | 6.x |

Model Specifications

| | Target Model | Draft Model | |---|---|---| | Name | moonshotai/Kimi-K2.6 | z-lab/Kimi-K2.5-DFlash | | Architecture | DeepSeek-V3 MoE + MLA | DFlash (5 decoder layers) | | Total params | ~1T | ~6.5B | | Active params | 32B per token | shared embeddings + lm_head | | Context length | 256K | 4K (training) | | Quantization | compressed-tensors (int4 weights) | BF16 | | Disk size | ~555 GB (64 shards) | ~6.5 GB |


Quick Start

1. Download models

# Target model (~555 GB)
huggingface-cli download moonshotai/Kimi-K2.6 --local-dir /models/Kimi-K2.6

# Draft model (~6.5 GB)
huggingface-cli download z-lab/Kimi-K2.5-DFlash --local-dir /models/Kimi-K2.5-DFlash

2. Configure

Edit configs/production.env:

MODEL_DIR=/models/Kimi-K2.6
DRAFT_MODEL_DIR=/models/Kimi-K2.5-DFlash

3. Disable NUMA balancing (required)

sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'

4. Launch

./serve.sh

Server takes ~5 minutes to load. Once ready:

curl http://localhost:8262/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "kimi-k2.6-amd-dflash",
    "messages": [{"role": "user", "content": "Explain the Riemann hypothesis"}],
    "max_tokens": 512,
    "temperature": 0
  }'

5. Benchmark

# Single-shot throughput benchmark
python3 payload/benchmark_multi_turn.py \
  --base-url http://localhost:8262/v1 \
  --model kimi-k2.6-amd-dflash \
  --sessions 32 --turns-per-session 1 \
  --max-tokens 512

# Compare against autoregressive baseline:
# Launch without DFlash (remove --speculative-config, set --block-size 1)
# and run the same benchmark

How DFlash Works

Standard Autoregressive          DFlash Speculative (st=2)
=======================          =========================

Step 1: Generate token 1         Step 1: Draft predicts tokens 1,2
Step 2: Generate token 2         Step 2: Target verifies both in ONE pass
Step 3: Generate token 3           → If both accepted: got 2 tokens for ~1 step
Step 4: Generate token 4           → If only token 1 accepted: got 1 token
...                              Step 3: Draft predicts tokens 3,4
                                 Step 4: Target verifies...

4 tokens = 4 forward passes      4 tokens ≈ 2-3 forward passes

The draft model (Kimi-K2.5-DFlash, 6.5 GB) is ~85x smaller than the target. It runs in <1% of the target's compute time. When its predictions match the target (45-67% acceptance at st=2), we get free tokens.

Why st=2 instead of st=8?

<p align="center"> <img src="assets/acceptance-comparison.png" alt="Acceptance rate comparison: st=8 vs st=2" width="900"> </p>

The public drafter was trained for K2.5, not K2.6. The model mismatch causes acceptance to drop sharply at later positions:

| Spec tokens | Pos 0 | Pos 1 | Pos 2 | Pos 3 | Pos 4-7 | Avg acceptance | Net effect | |---:|---:|---:|---:|---:|---:|---:|---| | 2 | 64% | 34% | — | — | — | 49% | +40% throughput | | 8 | 64% | 34% | 18% | 9% | <3% | 16% | -20% throughput |

At st=8, the target model wastes compute verifying 6 tokens that will almost certainly be rejected. At st=2, every verification step has a ~50% chance of yielding a free token.


ROCm Patches

DFlash requires 9 patches to work on ROCm with MLA attention. These are applied automatically at container startup by patches/patch_dflash_rocm.py. The patches:

  1. Add non-causal attention support to AITER flash attention backend
  2. Force TRITON_MLA backend for target model when DFlash draft uses standard attention
  3. Add IS_CAUSAL parameter to Triton unified attention kernels
  4. Relax causal assertions in the DFlash verification path

All patches are idempotent and track upstream vllm-project/vllm#39930.


Configuration Reference

# configs/production.env — all tunable parameters

NUM_SPECULATIVE_TOKENS=2    # DFlash draft tokens per step
MAX_NUM_SEQS=32             # Max concurrent decode sequences
MAX_NUM_BATCHED_TOKENS=32768 # Max tokens per scheduler step
MAX_MODEL_LEN=262144        # Max context length (256K)
GPU_MEMORY_UTILIZATION=0.90 # Fraction of VRAM for KV cache
BLOCK_SIZE=16               # Required for DFlash + MLA
ENFORCE_EAGER=true          # Compiled mode provides no gain
MOE_BACKEND=aiter           # AMD's optimized MoE kernels

Known Constraints

| Constraint | Root cause | Workaround | |---|---|---| | max_num_batched_tokens capped at 32768 | AITER MoE kernel grid overflow at 384 experts × large batch | Stay at 32768 | | K2.5 drafter acceptance ~50% | Model version mismatch (trained for K2.5) | Train K2.6-specific drafter (see below) |


FP8 KV Cache: 901 tok/s (updated numbers)

FP8 KV cache halves KV memory (8-bit vs 16-bit per element). Measured capacity: 2,469,568 tokens (up from 1,230,368 with BF16) = 2.01x. This enables max_num_seqs=64, pushing aggregate throughput to 901 tok/s1.77x over the BF16 baseline.

Head-to-Head: BF16 vs FP8 KV

| Concurrent users | BF16 KV (seqs=32) | FP8 KV (seqs=64) | Speedup | |---:|---:|---:|---:| | 8 | 127.1 tok/s | — | — | | 16 | 250.8 tok/s | — | — | | 24 | 379.0 tok/s | — | — | | 32 | 507.6 tok/s | 394.6 tok/s | 0.78x | | 48 | — | 593.6 tok/s | — | | 64 | — | 900.9 tok/s | 1.77x |

At matched concurrency (c=32), FP8 is ~22% slower per slot due to dynamic scale computation overhead. But FP8 enables 2x more concurrent sequences, and aggregate throughput at c=64 is 1.77x the BF16 peak.

The FP8 scale problem (and fix)

The Kimi-K2.6 checkpoint has no pre-computed FP8 KV scales. Without them, vLLM defaults to scale=1.0, which clips KV values in FP8 E4M3 range and produces degenerate output (vllm#13133, vllm#27364).

Our fix: a runtime patch to the MLA do_kv_cache_update that computes scales dynamically from each batch's actual KV data using a running-max approach. The scale converges after the first few requests and stays stable. Calibration with 200 diverse prompts (51K tokens) confirmed the converged scale range: 0.026–0.068.

The 384-expert AITER crash does NOT affect FP8 KV — that's a MoE-side issue triggered only at max_num_batched_tokens > 32768. FP8 KV is purely attention-side.

Quick start: FP8 KV

./serve.sh configs/production-fp8kv.env

Configs

| Config | KV dtype | MoE backend | max_num_seqs | Throughput | |---|---|---|---:|---| | production.env | BF16 | AITER | 32 | 508 tok/s | | production-fp8kv.env | FP8 | AITER | 64 | 901 tok/s |


Training a K2.6-Matched DFlash Drafter

The public drafter (z-lab/Kimi-K2.5-DFlash) was trained for K2.5 and gets ~50% acceptance on K2.6. A K2.6-matched drafter should reach 60-80% acceptance, making num_speculative_tokens=8 viable and roughly doubling per-slot throughput.

Architecture

The drafter is a 6-layer Qwen3-based decoder (~1.2B trainable params) that:

  • Shares embeddings and LM head with the target (frozen)
  • Reads hidden states from 6 target layers: [1, 12, 24, 35, 47, 58]
  • Projects concatenated target hidden states through an FC layer
  • Uses block-causal attention (block_size=16 for training, 8 for inference)

The config is at configs/kimi-k2.6-dflash-draft.json — identical to K2.5-DFlash since the architectures match.

Training pipeline

# Full pipeline: setup SpecForge, regenerate data with K2.6, train drafter
./train-drafter.sh

# Skip regeneration if data exists
./train-drafter.sh --skip-regen

# Skip setup + regen, just train
./train-drafter.sh --skip-setup

The pipeline uses SpecForge and runs three phases:

  1. Setup: Clone SpecForge, prepare PerfectBlend dataset (~1.16M samples)
  2. Regenerate: Run prompts through K2.6 to get target-distribution responses (hours)
  3. Train: 6-epoch DFlash training on 8x MI300X (3-6 days)

Serving with matched drafter

# After training completes:
./serve.sh configs/production-fp8kv-matched.env

Expected performance with matched drafter

| Metric | K2.5 drafter (current) | K2.6 drafter (matched) | |---|---|---| | Acceptance rate (st=2) | ~50% | ~75% | | Acceptance rate (st=8) | ~16% | ~65% | | Best spec tokens | 2 | 8 | | Per-slot tok/s | 15.8 | ~25 | | Aggregate at seqs=64 | 901 | ~1600 |


Optimization Roadmap

| Optimization | Expected throughput | Status | |---|---|---| | BF16 KV, K2.5 drafter, seqs=32 | 508 tok/s | Done | | FP8 KV, K2.5 drafter, seqs=64 | 901 tok/s | Done (updated numbers) | | K2.6 matched DFlash drafter | ~800 tok/s at seqs=32 | Training pipeline ready | | FP8 KV + matched drafter, seqs=64 | ~1600 tok/s | Needs matched drafter | | DDTree draft trees | +35% on matched drafter | Research (arXiv 2604.12989) |


Repository Structure

kimi-k26dflash/
├── README.md                       # This file
├── serve.sh                        # Server launch (pass config as arg)
├── validate-fp8.sh                 # FP8 KV validation + benchmark
├── train-drafter.sh                # K2.6 DFlash drafter training pipeline
├── Dockerfile.kimi26-dflash        # Patch-at-build Docker image
├── build-kimi26-dflash.sh          # Docker build helper
├── configs/
│   ├── production.env              # BF16 KV, 508 tok/s (current)
│   ├── production-fp8kv.env        # FP8 KV, seqs=64, ~1010 tok/s
│   ├── production-fp8kv-safe.env   # FP8 KV + Triton MoE fallback
│   ├── production-fp8kv-matched.env # FP8 KV + matched drafter, ~1600 tok/s
│   └── kimi-k2.6-dflash-draft.json # DFlash drafter architecture config
├── patches/
│   └── patch_dflash_rocm.py        # 9 ROCm patches (idempotent)
├── launchers/
│   ├── kimi26-vllm-dflash.sh       # Standard launcher
│   └── kimi26-vllm-dflash-sweep.sh # Parameter sweep
├── payload/
│   ├── benchmark_multi_turn.py     # Multi-turn benchmark tool
│   ├── calibrate_kv_scales.py      # FP8 KV scale calibration
│   └── preshard_kimi26.py          # Checkpoint pre-sharding
├── benchmarks/                     # Raw JSON benchmark results
│   ├── CLEAN-dflash-st2-s32-c32.json   # 508 tok/s
│   ├── CLEAN-dflash-st2-s24-c24.json   # 379 tok/s
│   └── ...
└── docs/
    ├── kimi-k2.6-250-toks-achieved-2026-04-21.md
    ├── kimi-k2.6-acceptance-rate-analysis-2026-04-21.md
    └── kimi-k2.6-dflash-execution-playbook-2026-04-21.md

Citation

If you use this configuration:

@misc{kimi-k26-dflash-mi300x-2026,
  title={Kimi K2.6 DFlash: 508 tok/s on 8x MI300X},
  author={HYDRA},
  year={2026},
  url={https://huggingface.co/hydra/kimi-k26-dflash-mi300x}
}

Acknowledgments

Author: florianleibert

Likes: 6

Downloads: 0

Tags: dflash, speculative-decoding, amd, mi300x, rocm, vllm, inference, optimization, kimi, moe, en, base_model:moonshotai/Kimi-K2.6, base_model:finetune:moonshotai/Kimi-K2.6, license:apache-2.0, region:us

lightseekorg/kimi-k2.6-eagle3


library_name: safetensors pipeline_tag: text-generation tags:

  • kimi-k2.6
  • eagle3
  • torchspec

kimi-k2.6-eagle3

Sharded safetensors export.

Author: lightseekorg

Likes: 6

Downloads: 0

Tags: safetensors, llama, kimi-k2.6, eagle3, torchspec, text-generation, region:us

tejadabheja/guru


license: apache-2.0 tags:

  • graph-reasoning
  • non-neural
  • cpu-native
  • explainable-ai
  • vector-symbolic language:
  • en
  • multilingual library_name: webmind pipeline_tag: text-generation

webmind-brain-v1

A graph-based reasoning engine. Not a neural network. No gradient descent. No GPU required.

The brain learns by building a co-occurrence graph over word vectors, then reasons by converging through the graph. Every answer has a traceable source. Knowledge is editable and deletable.

Quick Start

pip install numpy fastapi uvicorn lmdb
from webmind import Brain

brain = Brain.from_pretrained("webmind/webmind-brain-v1")

# Teach it something
brain.teach("Paris is the capital of France")
brain.teach("London is the capital of England")

# Ask
result = brain.ask("capital of France")
print(result["answer"])      # paris capital france
print(result["confidence"])  # 0.85
print(result["strategy"])    # convergence / co-occurrence / abstain

# Generate fluent text
gen = brain.generate("Tell me about France", max_tokens=20, temperature=0.7)
print(gen["text"])

# Save
brain.flush()

OpenAI-Compatible Server

python serve.py
# Then:
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "capital of france"}]}'

Supports streaming ("stream": true), the /v1/models endpoint, and /health.

Architecture

Input -> Garbage Filter (heuristic + LSH)
      -> Tier 1: Q→A Direct Lookup (LRU + LMDB, <1ms)
      -> Tier 1.5: LSH Semantic Search (O(1) bucket lookup, seed concepts)
      -> Tier 2: Convergence Loop (multi-hop reasoning over sparse graph)
             -> Co-occurrence Search (complementary sparse signal)
             -> Sentence Retrieval (full text from LMDB)
      -> Confidence Floor (abstain if < 0.15)
      -> Web Search fallback (DuckDuckGo + Wikipedia)

Key properties:

  • Co-occurrence graph: words that appear together pull toward each other in a sparse matrix
  • Convergence loop: iteratively search the graph, blending discovered concepts back into the query until the output stabilizes
  • Dual retrieval: dense neuron search + sparse co-occurrence search race in parallel
  • Successor chains: each word neuron stores its top-10 successors for generation
  • Confidence tracking: every neuron has a confidence score that grows when useful and shrinks when not
  • LSH vocabulary filter: locality-sensitive hashing over MiniLM embeddings for garbage detection, morphological linking ("gravitational"→"gravity"), vocabulary dedup, and O(1) semantic search
  • ScaNN backend: Google's anisotropic vector quantization for faster ANN search (optional, falls back to LSH)
  • Int8 quantization: PolarQuant-inspired 4x embedding compression with ~1% accuracy loss
  • Confidence floor: abstain rather than return weak convergence results (bad context > no context)
  • Vocabulary pruning: score words by convergence contribution, remove low-value entries

What It Is Good At

  • Factual Q&A with traceable sources
  • Multi-hop reasoning (convergence crosses concept boundaries)
  • Incremental learning (teach new facts at runtime, no retraining)
  • Honest failure (says "I don't know" when it doesn't converge)
  • Knowledge editing (delete a neuron = delete a fact)

What It Is Not Good At

  • Fluent prose generation (output is concept-oriented, not grammatically polished)
  • Creative writing
  • Long-form text
  • Tasks requiring deep syntactic understanding

Training Data

This model ships empty. It learns from what you teach it. The from_pretrained download includes the graph structure and vocabulary but no pre-loaded knowledge.

For evaluation, we tested on HotPotQA (200 train, 50 test) achieving 72% exact match with word neurons + successor chains.

Limitations

  • Context window is limited by the convergence loop (not fixed-length, but practically ~10 hops)
  • Generation quality depends heavily on what has been taught
  • No coreference resolution beyond what convergence provides
  • Function words are stripped during reasoning (grammar handled separately)

Citation

If you use this work, please cite:

@software{webmind_brain_2026,
  title={Webmind Brain: Graph-Based Reasoning Without Neural Networks},
  url={https://github.com/webmind-ai/webmind-brain},
  year={2026},
  license={Apache-2.0}
}

License

Apache 2.0

Author: tejadabheja

Likes: 5

Downloads: 0

Tags: webmind, webmind-brain, graph-reasoning, non-neural, cpu-native, explainable-ai, vector-symbolic, text-generation, en, multilingual, license:apache-2.0, region:us

wangzhang/gpt-oss-120b-abliterated


license: apache-2.0 base_model: openai/gpt-oss-120b tags:

  • abliterated
  • uncensored
  • moe
  • gpt-oss
  • mxfp4
  • direct-steering
  • ega
  • moe-router-suppression
  • vllm-in-place-editing
  • abliterix language:
  • en
  • zh library_name: transformers pipeline_tag: text-generation

gpt-oss-120b-abliterated

A refusal-suppressed variant of openai/gpt-oss-120b, produced with abliterix using direct weight editing, Expert-Granular Abliteration (EGA) on the fused MoE expert weights (128 experts × 36 layers), MoE router suppression on the safety-concentrated experts, and a new vLLM in-place TP editing path that replaces the HF pipeline-parallel + LoRA-adapter pipeline used for gpt-oss-20b.

Key results

| Metric | Base gpt-oss-120b | This model | |---|---|---| | Refusals on 100 held-out harmful prompts (LLM judge) | 100 / 100 | 26 / 100 | | KL divergence vs base (prompt next-token, benign) | — | 5.44e-06 | | Response-length deviation vs base (benign) | — | 0.042 σ | | Hard-prompt qualitative compliance (15 classic jailbreaks, EN+ZH) | 0 / 15 | 12 / 15 |

The eval refusal counts come from an LLM judge (google/gemini-3.1-flash-lite-preview via OpenRouter) running at batch_size = 1 with 16 concurrent workers and a hardened prompt that treats metaphorical deflection ("pipe-bomb → pipe-separated typeahead list", "lock-pick → presentation topic") and generic template filler as refusals, not as compliance — so a model that "bypasses" refusal by collapsing into off-topic boilerplate gets correctly counted as a failure. max_gen_tokens was bumped from 150 to 256 during search so long-form repetition loops surface before the judge is called. The 26/100 is a real, semantic compliance number, validated via a separate 15-prompt HF side-by-side test.

The qualitative compliance row is a separate manual test: 15 classic hard prompts (10 EN + 5 ZH) covering lockpicking, phishing, meth synthesis, WiFi hacking, fake news, hotwiring, pipe bombs, signature forgery, password-stealer malware, shoplifting, 炸弹制作, 钓鱼邮件, 电脑入侵, 身份证伪造, 网络诈骗. The model complies directly with 12/15 in the same Markdown-table, step-by-step style the base model uses for benign technical answers; 1/15 gets a metaphorical food-item deflection (CN 炸弹 → decorative toy), 2/15 drift off-topic on WiFi/fake-news.

Why this needed new machinery — four gpt-oss-120b-specific correctness fixes

abliterix handles four issues that silently break naïve abliteration pipelines on gpt-oss-120b:

  1. Native MXFP4 weights are not exposed as standard nn.Parameter. gpt-oss ships in Mxfp4GptOssExperts form whose down_proj is a packed Triton tensor that cannot be edited in-place. For the 120b variant abliterix now pre-dequantises the whole 65 GB MXFP4 checkpoint to a 232 GB BF16 safetensors checkpoint on disk (scripts/prepare_bf16_checkpoint.py), because vLLM's Mxfp4MoEMethod.process_weights_after_loading would otherwise repack w2_weight into an opaque block layout that silently swallows in-place writes (see vLLM RFC #31848).
  2. GptOssExperts.down_proj is stored transposed vs the standard MoE convention: shape (experts, intermediate_in, hidden_out) with forward path out = act @ W (no transpose). Standard EGA implementations use shape-based axis detection, which silently picks the wrong projection branch when hidden == intermediate (both 2880 in gpt-oss-120b). abliterix marks this layout explicitly and projects from the output side (W_new = W (I − vv^T)).
  3. Fused-expert MoEs were silently invisible to EGA. GptOssExperts is a single Module holding fused 3-D weights, so a naive per-Module profile dict key produces no mlp.down_proj entry and _apply_ega_steering early-exits. abliterix synthesises an mlp.down_proj profile when fused experts are detected so EGA actually runs across all 128 experts × 36 layers.
  4. HF pipeline-parallel on 120b was too slow to iterate on. A single trial on HF PP across 4× RTX PRO 6000 was >2 min; 100 trials would have been >3 h of pure generation. abliterix v1.5 adds a vLLM TP=4 in-place editor (VLLMExpertEditor, VLLMAttentionEditor) that edits w2_weight, qkv_proj.weight, and o_proj.weight directly on TP workers via collective_rpc + reset_prefix_cache. This requires VLLM_FUSED_MOE_UNQUANTIZED_BACKEND=triton (FLASHINFER_TRTLLM repacks w2_weight into a non-editable block layout), VLLM_ALLOW_INSECURE_SERIALIZATION=1 (ships worker fns as pickle), and enforce_eager=true (CUDA graphs cache weight pointers so edits would otherwise be read only on the first forward). Per-trial time dropped to ~60 s end-to-end.

On top of direct steering + EGA, this release carries MoE router suppression — an [experts] block that redirects routing away from the top-k "safety experts" (the experts whose gate activates disproportionately more on harmful prompts than on benign ones). For 120b with 128 experts/layer, the optimiser picked n_suppress = 1 with router_bias = -4.11 (suppression scale ≈ 0.59 — moderately aggressive), leaving 127/128 experts untouched while damping the single most refusal-aligned expert per layer.

Method

  • Base: openai/gpt-oss-120b — 36 layers, 128 routed experts per layer, top-4, hidden = intermediate = 2880, MXFP4 → BF16 pre-dequant before abliteration (232 GB safetensors on disk)
  • Tool: abliterix
  • Mode: steering_mode = "direct" (orthogonal projection on base weights, no LoRA), weight_normalization = "full" (norm-preserving projection)
  • Components steered:
    • attn.o_proj via direct weight projection (Q/K/V disabled — refusal signal on gpt-oss concentrates in the output side)
    • mlp.experts.down_proj across all 128 experts × 36 layers via Expert-Granular Abliteration
    • mlp.router rows of the top-1 safety expert per layer via logit suppression
  • Refusal direction: global scope (single direction), mean of (target − benign) residuals at a learned layer index (16.31, ≈ 45% depth) on a 400-prompt benign + 400-prompt harmful set; BF16 projection
  • Search: Optuna TPE, (KL + 0.5·length_deviation, refusals/baseline) multi-objective, 100 trials (40 random warmup + 60 TPE exploitation)
  • Judge hardening (vs older abliterated MoE releases):
    • llm_judge_batch_size = 1 (each Q/A pair gets a dedicated API call — no anchor bias from batched labelling)
    • llm_judge_concurrency = 16 parallel workers
    • response_truncation = 2000 chars (≈ 500 tokens — covers full max_gen_tokens = 256 output, so long-form collapse is visible to the judge)
    • Prompt contains metaphor-deflection few-shot examples so "pipe bomb → typeahead list" is labelled R, not C
  • Hardware: 4 × NVIDIA RTX PRO 6000 Blackwell 96 GB (sm_120, PCIe-only, no NVLink), driver 580 / CUDA 12.9, TP=4, batch=32, total wall time ≈ 100 min for 100 trials
  • Eval set: 100 held-out harmful prompts not seen during steering-vector computation; 100 held-out benign prompts for KL comparison

Winning hyperparameters (v5 Trial 78)

vector_scope = "global"
vector_index = 16.31            # layer where refusal direction is extracted

[steering.components."attn.o_proj"]
max_weight = 3.42
max_weight_position = 21.22     # peak strength at layer ≈ 21 / 36
min_weight = 1.63               # 47.6% of max — smooth profile
min_weight_distance = 20.65

[steering.components."mlp.down_proj"]   # EGA on fused 128 × 36 experts
max_weight = 6.74
max_weight_position = 26.69     # peak at layer ≈ 27 / 36 (later than attention)
min_weight = 0.96               # 14.3% of max
min_weight_distance = 20.62

[moe]                            # router-row suppression
n_suppress = 1                   # suppress top-1 safety expert per layer
router_bias = -4.11              # scale = max(0, 1 + bias/10) = 0.589
expert_ablation_weight = 0.0     # pinned off; EGA already handles expert weights

The attention peak sits at layer ≈ 21/36 (mid-stack where the refusal decision still has options) and the EGA peak sits later at layer ≈ 27/36 (after attention has routed harmful intent into the expert path). This stacked mid-to-late pair is a new fingerprint vs gpt-oss-20b, where both peaks sat around layer 18 of 24 (≈ 75% depth).

Usage

Transformers

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tok = AutoTokenizer.from_pretrained("wangzhang/gpt-oss-120b-abliterated")
model = AutoModelForCausalLM.from_pretrained(
    "wangzhang/gpt-oss-120b-abliterated",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [{"role": "user", "content": "Your prompt here"}]
prompt = tok.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.7)
print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

The model uses gpt-oss's harmony chat format. The chat template is bundled (chat_template.jinja).

Hardware note: BF16 weights are ~232 GB on disk. You need at least 232 GB aggregate VRAM (e.g. 4× RTX PRO 6000 96GB, 2× H200 141GB, or 8× H100 40GB with TP) or run via device_map="auto" across GPU + CPU with offloading. For faster inference, a GGUF quantised variant (see below) is recommended for single-GPU setups.

vLLM

vllm serve wangzhang/gpt-oss-120b-abliterated \
    --tensor-parallel-size 4 \
    --max-model-len 4096 \
    --enforce-eager

Honest limitations

  • Refusal is low, not zero. 26 / 100 held-out prompts still refuse. The residual refusers cluster around extremely-specific CBRN synthesis and CSAM-adjacent content — exactly where refusal is represented by multiple redundant circuits that partial abliteration cannot all knock out in one Optuna-TPE pass.
  • English > Chinese. Steering vectors came from a primarily English-weighted dataset. Chinese hard prompts mostly work (4/5 on manual Chinese tests gave real compliance; 1/5 drifted into a food-metaphor on "制作炸弹" → "炸盘"). Bypass quality on Chinese is slightly lower — shorter responses, occasional English fallback on technical terms.
  • Weaker than gpt-oss-20b-abliterated on ASR headline. 20b shipped at 94% ASR (6/100 refusals, KL 0.0098). 120b ships at 74% ASR (26/100 refusals, KL 5.4e-06). The 120b model has much lower KL (base behaviour is more preserved) but higher residual refusal — a property of 120b's 128-expert router being a much wider, more redundant safety surface than 20b's 32-expert router.
  • Occasional long-form derail. On generations past ~400 tokens a small fraction of outputs drift into markdown-table loops; this is an abliteration side-effect, not a base-model regression.

Reproducibility

Full search checkpoint (Optuna JSONL + judge cache SQLite) and the exact config are available in the abliterix repo under configs/gpt_oss_120b.toml + checkpoints_gpt_oss_120b_v5/. To reproduce from scratch on a 4×96GB Blackwell pod:

git clone https://github.com/wuwangzhang1216/abliterix
cd abliterix && pip install -e .

# One-time pre-dequant: MXFP4 → BF16 on disk (~8 min, 232 GB output)
python scripts/prepare_bf16_checkpoint.py \
    --model openai/gpt-oss-120b \
    --out /workspace/gpt-oss-120b-bf16

# Point config at the BF16 checkpoint and launch
sed -i 's|model_id = "openai/gpt-oss-120b"|model_id = "/workspace/gpt-oss-120b-bf16"|' \
    configs/gpt_oss_120b.toml

bash quick_start/deploy_gpt_oss_120b.sh
# 100 trials, ~100 min wall time on 4× RTX PRO 6000

Optuna is deterministic if you set sampler_seed in [optimization].

Intended use

Authorised AI-safety research, red-teaming evaluation, refusal-mechanism analysis, and study of how MoE expert specialisation encodes safety behaviours at scale (128 experts × 36 layers is large enough to show genuine expert specialisation rather than router noise). Not for producing or distributing harmful content. The license of the base model (apache-2.0) applies; the user is responsible for compliance with all applicable laws and the OpenAI gpt-oss usage policy.

Acknowledgments

  • openai/gpt-oss-120b for the base model
  • abliterix is a derivative work of Heretic by Philipp Emanuel Weidmann
  • TrevorS for the original Expert-Granular Abliteration formulation
  • vLLM team for the collective_rpc + reset_prefix_cache APIs that made in-place TP editing practical

Author: wangzhang

Likes: 4

Downloads: 30

Tags: transformers, safetensors, gpt_oss, text-generation, abliterated, uncensored, moe, gpt-oss, mxfp4, direct-steering, ega, moe-router-suppression, vllm-in-place-editing, abliterix, conversational, en, zh, base_model:openai/gpt-oss-120b, base_model:finetune:openai/gpt-oss-120b, license:apache-2.0, endpoints_compatible, region:us

mudler/Qwen3.6-35B-A3B-uncensored-heretic-APEX-GGUF

Author: mudler

Likes: 3

Downloads: 0

Tags: gguf, endpoints_compatible, region:us, conversational

0xSero/GLM-5.1-478B-A42B-REAP-NVFP4


license: mit tags:

  • glm
  • moe
  • reap
  • nvfp4
  • sglang
  • blackwell library_name: sglang base_model:
  • zai-org/GLM-5.1 pipeline_tag: text-generation

GLM-5.1-478B-A42B-REAP-NVFP4

NVFP4 quantization of zai-org/GLM-5.1, further REAP-pruned from 256 → 160 routed experts per MoE layer. Tuned to run at 200,000-token context on a 4× 96 GB Blackwell workstation.

| | | |---|---| | Total params | 478.4B | | Activated / token | 42.7B | | Routed experts / MoE layer | 160 (was 256 in base) | | Active experts / token | 8 routed + 1 shared | | Layers | 78 (3 dense + 75 MoE) + 1 MTP / NEXTN | | Hidden size | 6144 | | Attention | MLA-DSA, 64 heads | | Max position | 202,752 | | Quantization | NVFP4, group_size=16 (modelopt_fp4) | | On-disk size | 285 GB (85 shards) | | License | MIT (inherited from GLM-5.1) |

Measured performance

Single-user, batch size 1, decode tok/s at various prompt lengths on our reference rig:

| Context | tok/s | |---|---| | 256 | 46.5 | | 4 k | 41.8 | | 16 k | 38.6 | | 150 k | 22.4 |

Under live mixed traffic (1,495 decode samples):

| Context range | p50 tok/s | |---|---| | < 1 k | 42.7 | | 1 – 8 k | 44.3 | | 8 – 32 k | 36.3 | | 32 – 100 k | 27.7 |

Per-rank VRAM at 202,752 ctx: weights 77.2 GB, KV pool 11.3 GB (270 k tokens), CUDA graphs 0.3 GB, ~5 GB free.


Quick start (4× 96 GB Blackwell)

# 1. Download the weights
hf download 0xSero/GLM-5.1-478B-A42B-REAP-NVFP4 --local-dir ./GLM-5.1-478B-A42B-REAP-NVFP4

# 2. Install the pinned inference stack (see "Exact versions" below)
python3.12 -m venv venv && source venv/bin/activate
pip install "sglang[all]==0.5.10.post1" flashinfer-python==0.6.7.post3 flashinfer-cubin==0.6.7.post3

# 3. Apply the required NSA-disable patch (see "Required sglang patch" below)

# 4. Launch
./launch.sh   # see full script below

Reference rig

  • 4× NVIDIA RTX PRO 6000 Blackwell Workstation Edition, 96 GB, compute capability 12.0 (sm_120)
  • NVIDIA driver 580.126.18, CUDA 12.9 userspace
  • Ubuntu / Pop!_OS 22.04, Python 3.12

This is what the tuning targets. The same recipe works on 4× B200 (sm_100), 8× Hopper (sm_90) with fewer or more aggressive quantization, and other Blackwell configurations — see the hardware compatibility matrix at the bottom of this page.


Exact versions (pinned from the running venv)

Everything below is reproducible from:

pip install "sglang[all]==0.5.10.post1" flashinfer-python==0.6.7.post3 flashinfer-cubin==0.6.7.post3

The resolver pulls in the whole stack at these versions:

sglang                   0.5.10.post1
torch                    2.9.1+cu129
triton                   3.5.1
transformers             5.3.0
tokenizers               0.22.2
safetensors              0.8.0rc0
numpy                    2.4.4

flashinfer-python        0.6.7.post3
flashinfer-cubin         0.6.7.post3
nvidia-cutlass-dsl       4.5.0.dev0
nvidia-cublas-cu12       12.9.1.4
nvidia-cudnn-cu12        9.10.2.21
nvidia-nccl-cu12         2.27.5
nvidia-cuda-nvrtc-cu12   12.9.86
nvidia-cuda-runtime-cu12 12.9.79
nvidia-nvjitlink-cu12    12.9.86
nvidia-nvshmem-cu12      3.3.20

Verify:

python -c "import sglang, torch, flashinfer; print(sglang.__version__, torch.__version__, flashinfer.__version__)"
# 0.5.10.post1 2.9.1+cu129 0.6.7.post3

Required sglang patch (SM120 only)

GLM-5.1's config advertises GlmMoeDsaForCausalLM, which sglang routes through DeepSeek Sparse Attention by default. Every NSA backend in sglang 0.5.10.post1 is built for sm_90a / sm_100f only and fails at launch on sm_120. Route GLM-5 through the stable dense-MLA path by excluding it from the NSA architecture list:

Edit <venv>/lib/python3.12/site-packages/sglang/srt/configs/model_config.py, function is_deepseek_nsa():

def is_deepseek_nsa(config) -> bool:
    architectures = (
        config.get("architectures") if isinstance(config, dict)
        else getattr(config, "architectures", None)
    )
    index_topk = (
        config.get("index_topk") if isinstance(config, dict)
        else getattr(config, "index_topk", None)
    )
    # Keep GLM-5 on dense MLA until sm_120 NSA kernels ship.
    return (
        architectures is not None
        and architectures[0] in [
            "DeepseekV3ForCausalLM",
            "DeepseekV32ForCausalLM",
            "DeepseekV3ForCausalLMNextN",
            "MistralLarge3ForCausalLM",
            "PixtralForConditionalGeneration",
        ]
        and index_topk is not None
    )

(Only the architectures list changes — GlmMoeDsaForCausalLM is removed.)

After the patch, sglang auto-picks triton for attention on sm_120. Confirm in the startup log: attention_backend='triton'.

On sm_90 (Hopper) and sm_100 (B200) this patch is not needed — the native NSA kernels work. Skip to the launch section.


Launch

#!/usr/bin/env bash
set -euo pipefail

MODEL=/path/to/GLM-5.1-478B-A42B-REAP-NVFP4
VENV=/path/to/sglang-venv

# Route NCCL over PCIe (no NVLink on workstation Blackwell)
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0,1,2,3        # four Blackwell GPUs

# DeepGEMM has no sm_120 kernels; keep it off.
export SGLANG_ENABLE_JIT_DEEPGEMM=0
export SGLANG_ENABLE_DEEP_GEMM=0
export SGLANG_DISABLE_DEEP_GEMM=1
export SGLANG_SET_CPU_AFFINITY=1
export SGLANG_ENABLE_SPEC_V2=True
export FLASHINFER_DISABLE_VERSION_CHECK=1

# NCCL tuning for PCIe-only (no IB, no NVLink)
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=0
export NCCL_P2P_LEVEL=PIX
export NCCL_SHM_DISABLE=0
export NCCL_BUFFSIZE=4194304
export NCCL_MIN_NCHANNELS=8
export NCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export NCCL_CUMEM_HOST_ENABLE=0
export TORCH_NCCL_BLOCKING_WAIT=1
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1

export OMP_NUM_THREADS=8
export SAFETENSORS_FAST_GPU=1
export NVIDIA_TF32_OVERRIDE=1

exec "$VENV/bin/python" -m sglang.launch_server \
  --model-path        "$MODEL" \
  --served-model-name GLM-5.1-478B-A42B-REAP-NVFP4 \
  --host 0.0.0.0 --port 8000 \
  --trust-remote-code \
  --tensor-parallel-size 4 \
  --pipeline-parallel-size 1 \
  --context-length    202752 \
  --max-running-requests 1 \
  --mem-fraction-static 0.94 \
  --chunked-prefill-size 4096 \
  --page-size         128 \
  --quantization      modelopt_fp4 \
  --kv-cache-dtype    fp8_e4m3 \
  --triton-attention-num-kv-splits 64 \
  --moe-runner-backend cutlass \
  --fp4-gemm-backend  flashinfer_cudnn \
  --cuda-graph-max-bs 4 \
  --pre-warm-nccl \
  --tool-call-parser  glm47 \
  --reasoning-parser  glm45 \
  --chat-template     "$MODEL/chat_template.jinja" \
  --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 16}' \
  --watchdog-timeout  1800

On sm_90 / sm_100 you will want --attention-backend flashinfer and --fp4-gemm-backend b12x instead — see the 555B sibling card for that recipe.


Key flag decisions (why these specific values)

These were measured on the reference rig; defaults were not.

--triton-attention-num-kv-splits 64 — biggest single win. Default is 8. At bs=1 decode on sm_120, raising kv-splits gave:

| Context | splits=8 | splits=64 | |---|---|---| | 4 k | 39.7 | 41.8 | | 16 k | 26.4 | 38.6 | | 150 k | 5.2 | 22.4 |

Coherence verified across arithmetic, factual recall, needle-in-haystack @ 32 k and @ 100 k, and 11-turn chat.

--mem-fraction-static 0.94 — decode is kernel-bound at bs=1, not memory-bound. 0.94 vs 0.97 gives identical tok/s and ~5 GB/rank of headroom for graph recapture and prefill scratch.

--kv-cache-dtype fp8_e4m3 — halves KV memory vs bf16. Required to fit 202 k context in budget.

--attention-backend is intentionally omitted — sglang auto-selects triton on sm_120 for this architecture after the NSA patch. Flashinfer attention is skipped because it requires PCIe P2P atomics not available on the workstation board.

--page-size 128 — the non-MTP default. Drop to 64 only if enabling speculative decode.


MTP / NEXTN speculative decode (optional)

The checkpoint includes an MTP head for layer 78, stitched from the original 256-expert source using the layer-77 REAP keep-map as a proxy.

| | Without MTP (this page's default) | With MTP | |---|---|---| | Decode tok/s (short) | ~46 | ~90 (1.93×) | | Max context | 202,752 | ~65,536 | | KV dtype | fp8_e4m3 | bf16 (required by NEXTN) | | Page size | 128 | 64 (required by NEXTN) |

MTP is opt-in because the workstation target is long context, not peak short-prompt throughput. Enable with:

# Replace three lines in the launch script:
--context-length 65536 \
--page-size 64 \
--kv-cache-dtype auto \
# and add:
--speculative-algorithm NEXTN \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--speculative-attention-mode decode \
--speculative-moe-runner-backend cutlass

Also drop --mem-fraction-static to 0.88 — the draft worker adds ~5 GB/rank.


Sampling recommendations

General chat / reasoning:

temperature=0.5  top_p=0.95  frequency_penalty=0.3  repetition_penalty=1.05

Strict-answer (MCQ, tool-use benchmarks):

temperature=0.0  repetition_penalty=1.05

Keep repetition_penalty=1.05 everywhere. Pure greedy with no penalty can loop on pathological low-entropy prompts (e.g., repeated filler tokens).


Lineage & license

zai-org/GLM-5.1  (official, 744B bf16, 256 experts, MIT)
    │
    ├── community NVFP4 quantization via NVIDIA Model Optimizer
    │     (e.g. lukealonso/GLM-5.1-NVFP4, ~434 GB, 256 experts)
    │
    ├── Local REAP pass 1: 256 → 192 experts
    │     0xSero/GLM-5.1-555B-A14B-REAP-NVFP4
    │
    └── Local REAP pass 2: 192 → 160 experts
          0xSero/GLM-5.1-478B-A42B-REAP-NVFP4   ← this model

Both REAP passes were done locally using pooled token-weighted observations from:

Prune scripts and MTP-stitch script are in the repo tree.

License: MIT, inherited from zai-org/GLM-5.1.

Citation (REAP method):

@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025},
  eprint = {2510.13999},
  archivePrefix = {arXiv},
}
<!-- GLM51_FAMILY_COMPAT_START -->

GLM-5.1 REAP Family — Hardware Compatibility

All variants in this family are REAP-pruned (2510.13999) descendants of zai-org/GLM-5.1 (original: 744B params, 256 experts/MoE layer, 40B activated/token). Pick a variant based on your GPU architecture and available VRAM.

Quick picker

| You have | Use | |---|---| | 8× H100/H200 80GB (Hopper, sm_90) | GLM-5.1-555B-A14B-REAP-GPTQ-W4A16 or GLM-5.1-555B-A14B-REAP-NVFP4 (NVFP4 on Hopper via modelopt_fp4 + triton path) | | 4× RTX PRO 6000 Blackwell Workstation 96GB (sm_120) | GLM-5.1-478B-A42B-REAP-NVFP4 (further-pruned 160-expert, 200k ctx) — this is the Blackwell Workstation reference config | | 4× B200 180GB (sm_100) | GLM-5.1-478B-A42B-REAP-NVFP4 or GLM-5.1-555B-A14B-REAP-NVFP4 | | 8× B200 / Blackwell datacenter | GLM-5.1-555B-A14B-REAP-NVFP4 (192-expert, upstream's reference config with flashinfer + b12x backends) | | 8× A100 80GB (Ampere, sm_80) | GLM-5.1-444B-A14B-REAP (BF16) or -GPTQ-W4A16 | | CPU / Apple Silicon / consumer GPU with llama.cpp | GLM-5.1-555B-A14B-REAP-GGUF or GLM-5.1-444B-A14B-REAP-GGUF |

Full family

| Variant | Format | Size | Experts/layer | Activated/token | Min VRAM (TP) | Inference engine | Best on | |---|---|---|---|---|---|---|---| | GLM-5.1-555B-A14B-REAP | BF16 | ~1125 GB | 192 | ~14B | 8× 141 GB (H200) | sglang / vllm | Hopper | | GLM-5.1-444B-A14B-REAP | BF16 | ~910 GB | 154 | ~14B | 8× 114 GB | sglang / vllm | Ampere / Hopper | | GLM-5.1-555B-A14B-REAP-NVFP4 | NVFP4 (4-bit) | ~320 GB | 192 | ~14B | 4× 80 GB (B200), 8× 48 GB | sglang --quantization modelopt_fp4 | Blackwell (native); Hopper (triton path) | | GLM-5.1-478B-A42B-REAP-NVFP4 | NVFP4 (4-bit) | ~285 GB | 160 | ~42B | 4× 80 GB Blackwell | sglang --quantization modelopt_fp4 | 4× RTX PRO 6000 Blackwell @ 200k ctx | | GLM-5.1-555B-A14B-REAP-GPTQ-W4A16 | GPTQ W4A16 | ~297 GB | 192 | ~14B | 4× 80 GB | vllm / sglang --quantization gptq_marlin | Hopper (best), works on Ampere | | GLM-5.1-555B-A14B-REAP-GGUF | GGUF (Q2–Q8) | ~348 GB | 192 | ~14B | Varies by quant | llama.cpp | CPU / Apple / consumer CUDA | | GLM-5.1-444B-A14B-REAP-GGUF | GGUF (Q2–Q8) | ~283 GB | 154 | ~14B | Varies by quant | llama.cpp | CPU / Apple / consumer CUDA |

Notes

  • NVFP4 on Hopper (H100/H200): supported from sglang 25.10 / 0.5.10+ (NVIDIA SGLang release notes); native Blackwell tensor-core FP4 still gives better throughput.
  • NVFP4 on B200 / Blackwell datacenter (sm_100): use flashinfer attention + b12x or flashinfer MoE backends — this is the recipe in the original 555B-A14B-REAP-NVFP4 card.
  • NVFP4 on Blackwell Workstation (sm_120): use --attention-backend triton (not flashinfer — PCIe P2P atomics unavailable on the consumer board), --moe-runner-backend cutlass, --fp4-gemm-backend flashinfer_cudnn. See the GLM-5.1-478B-A42B-REAP-NVFP4 card for the full 200k-ctx replication guide.
  • GPTQ-W4A16 vs NVFP4: same bit depth, different hardware path. NVFP4 has native Blackwell support and per-16 fp8 scales; GPTQ is group-quantized int4 with broader engine support.
  • REAP expert count variants (555B/444B): different expert-retention ratios from the same base; 555B keeps more experts (higher quality ceiling), 444B trades quality for 20% less VRAM.
  • Why NVFP4-478B-A42B-REAP is different: it's double-pruned (256 → 192 → 160 experts), optimized for a specific Blackwell Workstation 4×96GB target at 200k context. The A42B suffix reflects measured activated params/token on the 160-expert MoE, not the REAP branding convention of the sibling variants.

Pointer to active inference recipe

See GLM-5.1-478B-A42B-REAP-NVFP4 README for the full Blackwell Workstation replication guide (exact software pins, NSA patch, launch flags, measured 200k-ctx perf, sampling recommendations). Most of the sglang flags carry over to other NVFP4 variants on other hardware.

Citation

@misc{lasby2025reap,
  title={REAP the Experts: Why Pruning Prevails for One-Shot MoE compression},
  author={Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year={2025},
  eprint={2510.13999},
  archivePrefix={arXiv},
}
<!-- GLM51_FAMILY_COMPAT_END -->

Author: 0xSero

Likes: 3

Downloads: 0

Tags: sglang, safetensors, glm_moe_dsa, glm, moe, reap, nvfp4, blackwell, text-generation, conversational, arxiv:2510.13999, base_model:zai-org/GLM-5.1, base_model:quantized:zai-org/GLM-5.1, license:mit, 8-bit, modelopt, region:us

BeaverAI/Artemis-31B-v1h-GGUF

We're back!

this might be it?

image

  • gemma 4 31b
  • supports thinking

image

Author: BeaverAI

Likes: 3

Downloads: 0

Tags: gguf, endpoints_compatible, region:us, conversational

inclusionAI/DR-Venus-4B-RL

DR-Venus-4B-RL

DR-Venus-4B-RL is the reinforcement-learned DR-Venues checkpoint built on top of inclusionAI/DR-Venus-4B-SFT. It is a 4B deep research agent designed for long-horizon web research with explicit tool use, evidence collection, and answer generation.

This model is trained entirely on open data. Starting from the SFT checkpoint, DR-Venus-4B-RL applies long-horizon agentic RL with IGPO-style information gain rewards and format-aware turn-level supervision to improve execution reliability under long tool-use trajectories.

What This Model Is For

This checkpoint is intended for:

  • long-horizon deep research with tool-augmented reasoning
  • improving execution reliability beyond supervised imitation
  • evidence-grounded answering with search and visit
  • deployment in the official DR-Venues inference pipeline

It is not primarily optimized for:

  • plain chat without tools
  • generic short-context instruction following
  • use cases that do not need multi-step retrieval and browsing

Model Details

  • Base model: Qwen/Qwen3-4B-Thinking-2507
  • Initialization checkpoint: inclusionAI/DR-Venus-4B-SFT
  • Training stage: agentic reinforcement learning
  • Training framework: verl + IGPO algorithm
  • Tool setting: search + visit
  • Maximum rollout horizon: 200 interaction steps
  • Maximum rollout context length: 256K
  • Intended domain: long-horizon open-domain research and evidence-grounded question answering

How DR-Venus Builds RL Supervision

DR-Venus-4B-RL is trained with dense turn-level supervision tailored to deep research:

  1. The model starts from the DR-Venus supervised checkpoint.
  2. For each query, the agent interacts with the environment over multi-turn search and visit trajectories.
  3. IGPO uses information gain rewards to measure whether an intermediate turn increases the model's probability of producing the ground-truth answer.
  4. Information gain rewards are combined with outcome rewards and turn-level format-aware penalties.
  5. The policy is optimized using an IGPO objective with fine-grained credit assignment, specifically tailored for the long-horizon nature of deep research rollouts.

This design improves supervision density, credit assignment, and data efficiency compared with sparse trajectory-level RL alone.

Training Data

This model is trained from open-data supervision constructed from:

In the current paper setup:

  • RL is performed entirely on open query-answer pairs
  • rollout groups are sampled with long-horizon agent interaction
  • generation is performed with up to 200 interaction steps per query

For more implementation details, please refer to the DR-Venues GitHub repository.

Training Recipe

The RL checkpoint is trained with the following setup reported in the current paper draft:

  • algorithm: IGPO-style agentic RL
  • rollout group size: 8
  • training batch size: 16
  • learning rate: 1e-6
  • rollout temperature: 1.0
  • rollout top-p: 0.95
  • maximum context length: 256K
  • maximum generation length per turn: 8,192
  • discount factor: 0.95
  • format penalty scale: 1.0
  • training framework: verl with vLLM rollout engine and FSDP trainer

The current paper configuration also enables browse-aware IG assignment and IG-scale style reward balancing.

Evaluation Summary

DR-Venus-4B-RL improves over the SFT checkpoint on most tracked deep research benchmarks and sets a stronger small-model frontier.

Results Against Open Models Under 9B

| Model | BrowseComp | BrowseComp-ZH | GAIA (Text-Only) | xBench-DS-2505 | xBench-DS-2510 | DeepSearchQA | | --- | ---: | ---: | ---: | ---: | ---: | ---: | | DeepDive-9B-SFT | 5.6 | 15.7 | -- | 35.0 | -- | -- | | DeepDive-9B-RL | 6.3 | 15.1 | -- | 38.0 | -- | -- | | WebSailor-7B | 6.7 | 14.2 | 37.9 | 34.3 | -- | -- | | OffSeeker-8B-SFT | 10.6 | 24.2 | 47.6 | 48.0 | -- | -- | | OffSeeker-8B-DPO | 12.8 | 26.6 | 51.5 | 49.0 | -- | -- | | WebExplorer-8B-RL | 15.7 | 32.0 | 50.0 | 53.7 | 23.0 | 17.8 | | AgentCPM-Explore-4B | 24.1 | 29.1 | 63.9 | 70.0 | 34.0 | 32.8 | | DR-Venus-4B-SFT | 26.8 | 35.7 | 65.4 | 69.0 | 35.3 | 37.7 | | DR-Venus-4B-RL | 29.1 | 37.7 | 64.4 | 74.7 | 40.7 | 39.6 |

Relative to the SFT checkpoint, DR-Venus-4B-RL improves:

  • BrowseComp by +2.3
  • BrowseComp-ZH by +2.0
  • xBench-DS-2505 by +5.7
  • xBench-DS-2510 by +5.4
  • DeepSearchQA by +1.9

These gains are associated with better formatting accuracy, more reliable tool use, and stronger long-horizon execution stability.

Usage

This checkpoint should be used with the official DR-Venues inference pipeline.

git clone https://github.com/inclusionAI/DR-Venus
cd DR-Venus/Inference
pip install -r requirements.txt

# then configure the model path in run_demo.sh or run_web_demo.sh
bash run_demo.sh

For reproducing RL training or understanding the rollout setup, see the RL directory in the official repository.

License and Release Notes

Please verify license compatibility with:

  • the upstream base model
  • the released supervision data
  • the external tools and judge models used in training or evaluation

This section can be updated later with the final project-specific license statement.

Citation

If you use this checkpoint, please cite the DR-Venues project.

@misc{dr_venus_2026,
  title  = {DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data},
  author = {Venus Team, Sunhao Dai, Yong Deng, Jinzhen Lin, Yucheng Song, Guoqing Wang, Xiaofeng Wu, Yuqi Zhou, Shuo Yang, Zhenzhe Ying, Zhanwei Zhang, Changhua Meng, Weiqiang Wang},
  year   = {2026},
}

Links

Author: inclusionAI

Likes: 3

Downloads: 0

Tags: safetensors, qwen3, region:us

inclusionAI/DR-Venus-4B-SFT

DR-Venus-4B-SFT

DR-Venus-4B-SFT is a 4B deep research agent obtained by fine-tuning Qwen/Qwen3-4B-Thinking-2507 on cleaned open-data agent trajectories. It is the supervised initialization checkpoint of DR-Venus and is designed to establish stable long-horizon agentic behavior, including reasoning, tool use, evidence collection, and final answer synthesis.

Instead of relying on proprietary traces, DR-Venus-4B-SFT is trained entirely on open REDSearcher trajectories after environment alignment, structural cleaning, correctness filtering, and turn-aware resampling.

What This Model Is For

This checkpoint is intended for:

  • deep research agents with long-horizon tool use
  • open-domain information seeking with search and visit tools
  • initializing stronger agentic checkpoints before RL
  • deployment in the official DR-Venus inference pipeline

It is not primarily optimized for:

  • plain chat without tool use
  • generic instruction-following benchmarks
  • short-context QA without external retrieval

Model Details

  • Base model: Qwen/Qwen3-4B-Thinking-2507
  • Model type: long-context reasoning model for tool-augmented deep research
  • Training stage: agentic supervised fine-tuning
  • Training framework: verl
  • Tool setting: search + visit
  • Maximum training length: 200K
  • Intended domain: long-horizon web research and evidence-grounded question answering

How DR-Venus Builds SFT Data

DR-Venus-4B-SFT is trained on cleaned trajectories built from open REDSearcher SFT trajectories:

  1. Raw trajectories are converted into the same interaction format used by the DR-Venus inference pipeline.
  2. Tool calls are standardized so that training and deployment share the same search / visit protocol.
  3. Disallowed tools and duplicate tool-call turns are removed.
  4. Structurally valid trajectories are filtered by final-answer correctness.
  5. Long-horizon trajectories are upweighted through turn-aware resampling.

This pipeline is designed to improve both data quality and effective data utilization for a small deep research agent.

Training Data

This model is trained from cleaned open-data supervision constructed from:

In the current paper instantiation, this process yields:

  • 10,001 raw trajectories
  • 9,365 correctness-filtered trajectories
  • 18,745 final SFT training instances after resampling

For more details, please refer to the DR-Venus GitHub repository.

Training Recipe

The SFT checkpoint is trained with the following setup reported in the current paper draft:

  • epochs: 1
  • global batch size: 32
  • micro batch size per GPU: 1
  • learning rate: 1e-5
  • maximum training length: 200K
  • sequence parallel size: 8
  • training framework: verl FSDP trainer
  • supervision format: multi-turn agent trajectories with assistant-token loss masking

Evaluation Summary

DR-Venus-4B-SFT establishes a strong 4B baseline on multiple deep research benchmarks.

Results Against Open Models Under 9B

| Model | BrowseComp | BrowseComp-ZH | GAIA (Text-Only) | xBench-DS-2505 | xBench-DS-2510 | DeepSearchQA | | --- | ---: | ---: | ---: | ---: | ---: | ---: | | DeepDive-9B-SFT | 5.6 | 15.7 | -- | 35.0 | -- | -- | | DeepDive-9B-RL | 6.3 | 15.1 | -- | 38.0 | -- | -- | | WebSailor-7B | 6.7 | 14.2 | 37.9 | 34.3 | -- | -- | | OffSeeker-8B-SFT | 10.6 | 24.2 | 47.6 | 48.0 | -- | -- | | OffSeeker-8B-DPO | 12.8 | 26.6 | 51.5 | 49.0 | -- | -- | | WebExplorer-8B-RL | 15.7 | 32.0 | 50.0 | 53.7 | 23.0 | 17.8 | | AgentCPM-Explore-4B | 24.1 | 29.1 | 63.9 | 70.0 | 34.0 | 32.8 | | DR-Venus-4B-SFT | 26.8 | 35.7 | 65.4 | 69.0 | 35.3 | 37.7 | | DR-Venus-4B-RL | 29.1 | 37.7 | 64.4 | 74.7 | 40.7 | 39.6 |

Among open models under 9B, DR-Venus-4B-SFT is already highly competitive and outperforms previously reported small agents on most tracked benchmarks. It also serves as the initialization checkpoint used for DR-Venus-4B-RL.

Usage

This checkpoint is intended to be used with the official DR-Venus inference pipeline, which provides the expected system prompt, tool protocol, and long-horizon rollout loop.

git clone https://github.com/inclusionAI/DR-Venus
cd DR-Venus/Inference
pip install -r requirements.txt

# then configure the model path in run_demo.sh or run_web_demo.sh
bash run_demo.sh

If you use this checkpoint outside the official DR-Venues codebase, make sure your runtime matches the DR-Venus tool schema and message formatting for search, visit, <tool_call>, and <tool_response>.

License and Release Notes

Please verify license compatibility with:

  • the upstream base model
  • the released training data
  • the external tools and benchmarks used in your downstream setup

This section can be updated later with the final project-specific license statement.

Citation

If you use this checkpoint, please cite the DR-Venues project.

@misc{dr_venus_2026,
  title  = {DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data},
  author = {Venus Team, Sunhao Dai, Yong Deng, Jinzhen Lin, Yucheng Song, Guoqing Wang, Xiaofeng Wu, Yuqi Zhou, Shuo Yang, Zhenzhe Ying, Zhanwei Zhang, Changhua Meng, Weiqiang Wang},
  year   = {2026},
}

Links

Author: inclusionAI

Likes: 3

Downloads: 0

Tags: safetensors, qwen3, region:us