Todays AI Summary

AI Developments: GroveMoE Model Excels in Benchmarks, Research Explores MLLM Control

Here's a look at the latest AI models and research papers, focusing on key advancements and potential impact.

Research Highlights

  • Mammogram VQA with GPT-5: A paper titled "Is ChatGPT-5 Ready for Mammogram VQA?" evaluates GPT-5's performance on mammogram visual question answering tasks. While GPT-5 showed improvements over previous models, it still lags behind human experts and domain-specific fine-tuned models, indicating the need for further domain adaptation for high-stakes clinical applications.
  • Controlling Multimodal LLMs: Research explores controlling Multimodal Large Language Models (MLLMs) through reward-guided decoding. The method builds reward models for visual grounding, allowing users to dynamically trade off object precision for recall in image captioning tasks. The approach outperforms existing hallucination mitigation methods.
  • Efficient Reasoning in LLMs: A paper introduces the Dynamic Reasoning-Boundary Self-Awareness Framework (DR. SAF) to improve the efficiency of large language models (LLMs) on complex reasoning tasks. DR. SAF enables models to dynamically assess and adjust their reasoning depth in response to problem complexity, achieving a 49.27% reduction in total response tokens with minimal loss in accuracy.
  • Data Mixture Re-weighting for Language Models: A study proposes using Bayesian Optimization to determine the optimal data mixture for large language model training. The approach demonstrates strong results relative to a wide range of benchmarks, showing speed-ups of over 500% in determining the best data mixture on the largest experiments.

Model Spotlight

  • GroveMoE-Inst: This model introduces a new sparse architecture using adjugate experts for dynamic computation allocation. With 33B parameters, it activates only 3.14–3.28B per token. GroveMoE-Inst demonstrates strong performance across various benchmarks, including MMLU-Pro, SuperGPQA, and OlympiadBench, outperforming models like Llama4-Scout and Qwen3-30B-A3B.
  • AuriStream-1B: AuriStream is a biologically-inspired, GPT-style autoregressive Transformer trained to predict tokens from the speech stream. It utilizes a long context window of (~20 s, ~4096 tokens) and is trained on LibriLight (~60k hours) for 500k steps.

Key Takeaways

  • Specialization Still Matters: While general LLMs are improving, domain-specific fine-tuning remains crucial for high-stakes applications like medical imaging.
  • Efficiency is a Key Focus: Research is actively exploring methods to improve the efficiency of LLMs, addressing concerns about computational cost and real-time performance.
  • Novel Architectures Emerge: GroveMoE showcases a novel approach to sparse architectures, achieving impressive performance with a fraction of activated parameters.
  • Multimodal Control is Gaining Traction: The ability to control MLLMs through reward-guided decoding opens up new possibilities for adapting these models to diverse user needs.

AI Papers for 2026-04-13

Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

The advent of agentic multimodal models has empowered systems to actively interact with external environments. However, current agents suffer from a profound meta-cognitive deficit: they struggle to arbitrate between leveraging internal knowledge and querying external utilities. Consequently, they frequently fall prey to blind tool invocation, resorting to reflexive tool execution even when queries are resolvable from the raw visual context. This pathological behavior precipitates severe latency bottlenecks and injects extraneous noise that derails sound reasoning. Existing reinforcement learning protocols attempt to mitigate this via a scalarized reward that penalizes tool usage. Yet, this coupled formulation creates an irreconcilable optimization dilemma: an aggressive penalty suppresses essential tool use, whereas a mild penalty is entirely subsumed by the variance of the accuracy reward during advantage normalization, rendering it impotent against tool overuse. To transcend this bottleneck, we propose HDPO, a framework that reframes tool efficiency from a competing scalar objective to a strictly conditional one. By eschewing reward scalarization, HDPO maintains two orthogonal optimization channels: an accuracy channel that maximizes task correctness, and an efficiency channel that enforces execution economy exclusively within accurate trajectories via conditional advantage estimation. This decoupled architecture naturally induces a cognitive curriculum-compelling the agent to first master task resolution before refining its self-reliance. Extensive evaluations demonstrate that our resulting model, Metis, reduces tool invocations by orders of magnitude while simultaneously elevating reasoning accuracy.

SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds

Robotic manipulation with deformable objects represents a data-intensive regime in embodied learning, where shape, contact, and topology co-evolve in ways that far exceed the variability of rigids. Although simulation promises relief from the cost of real-world data acquisition, prevailing sim-to-real pipelines remain rooted in rigid-body abstractions, producing mismatched geometry, fragile soft dynamics, and motion primitives poorly suited for cloth interaction. We posit that simulation fails not for being synthetic, but for being ungrounded. To address this, we introduce SIM1, a physics-aligned real-to-sim-to-real data engine that grounds simulation in the physical world. Given limited demonstrations, the system digitizes scenes into metric-consistent twins, calibrates deformable dynamics through elastic modeling, and expands behaviors via diffusion-based trajectory generation with quality filtering. This pipeline transforms sparse observations into scaled synthetic supervision with near-demonstration fidelity. Experiments show that policies trained on purely synthetic data achieve parity with real-data baselines at a 1:15 equivalence ratio, while delivering 90% zero-shot success and 50% generalization gains in real-world deployment. These results validate physics-aligned simulation as scalable supervision for deformable manipulation and a practical pathway for data-efficient policy learning.

Seeing but Not Thinking: Routing Distraction in Multimodal Mixture-of-Experts

Multimodal Mixture-of-Experts (MoE) models have achieved remarkable performance on vision-language tasks. However, we identify a puzzling phenomenon termed Seeing but Not Thinking: models accurately perceive image content yet fail in subsequent reasoning, while correctly solving identical problems presented as pure text. Through systematic analysis, we first verify that cross-modal semantic sharing exists in MoE architectures, ruling out semantic alignment failure as the sole explanation. We then reveal that visual experts and domain experts exhibit layer-wise separation, with image inputs inducing significant routing divergence from text inputs in middle layers where domain experts concentrate. Based on these findings, we propose the Routing Distraction hypothesis: when processing visual inputs, the routing mechanism fails to adequately activate task-relevant reasoning experts. To validate this hypothesis, we design a routing-guided intervention method that enhances domain expert activation. Experiments on three multimodal MoE models across six benchmarks demonstrate consistent improvements, with gains of up to 3.17% on complex visual reasoning tasks. Our analysis further reveals that domain expert identification locates cognitive functions rather than sample-specific solutions, enabling effective transfer across tasks with different information structures.

AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation

Text-to-Audio-Video (T2AV) generation is rapidly becoming a core interface for media creation, yet its evaluation remains fragmented. Existing benchmarks largely assess audio and video in isolation or rely on coarse embedding similarity, failing to capture the fine-grained joint correctness required by realistic prompts. We introduce AVGen-Bench, a task-driven benchmark for T2AV generation featuring high-quality prompts across 11 real-world categories. To support comprehensive assessment, we propose a multi-granular evaluation framework that combines lightweight specialist models with Multimodal Large Language Models (MLLMs), enabling evaluation from perceptual quality to fine-grained semantic controllability. Our evaluation reveals a pronounced gap between strong audio-visual aesthetics and weak semantic reliability, including persistent failures in text rendering, speech coherence, physical reasoning, and a universal breakdown in musical pitch control. Code and benchmark resources are available at http://aka.ms/avgenbench.

OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks

Group Relative Policy Optimization (GRPO) has emerged as the de facto Reinforcement Learning (RL) objective driving recent advancements in Multimodal Large Language Models. However, extending this success to open-source multimodal generalist models remains heavily constrained by two primary challenges: the extreme variance in reward topologies across diverse visual tasks, and the inherent difficulty of balancing fine-grained perception with multi-step reasoning capabilities. To address these issues, we introduce Gaussian GRPO (G$^2$RPO), a novel RL training objective that replaces standard linear scaling with non-linear distributional matching. By mathematically forcing the advantage distribution of any given task to strictly converge to a standard normal distribution, $\mathcal{N}(0,1)$, G$^2$RPO theoretically ensures inter-task gradient equity, mitigates vulnerabilities to heavy-tail outliers, and offers symmetric update for positive and negative rewards. Leveraging the enhanced training stability provided by G$^2$RPO, we introduce two task-level shaping mechanisms to seamlessly balance perception and reasoning. First, response length shaping dynamically elicits extended reasoning chains for complex queries while enforce direct outputs to bolster visual grounding. Second, entropy shaping tightly bounds the model's exploration zone, effectively preventing both entropy collapse and entropy explosion. Integrating these methodologies, we present OpenVLThinkerV2, a highly robust, general-purpose multimodal model. Extensive evaluations across 18 diverse benchmarks demonstrate its superior performance over strong open-source and leading proprietary frontier models.

RewardFlow: Generate Images by Optimizing What You Reward

We introduce RewardFlow, an inversion-free framework that steers pretrained diffusion and flow-matching models at inference time through multi-reward Langevin dynamics. RewardFlow unifies complementary differentiable rewards for semantic alignment, perceptual fidelity, localized grounding, object consistency, and human preference, and further introduces a differentiable VQA-based reward that provides fine-grained semantic supervision through language-vision reasoning. To coordinate these heterogeneous objectives, we design a prompt-aware adaptive policy that extracts semantic primitives from the instruction, infers edit intent, and dynamically modulates reward weights and step sizes throughout sampling. Across several image editing and compositional generation benchmarks, RewardFlow delivers state-of-the-art edit fidelity and compositional alignment.

PSI: Shared State as the Missing Layer for Coherent AI-Generated Instruments in Personal AI Agents

Personal AI tools can now be generated from natural-language requests, but they often remain isolated after creation. We present PSI, a shared-state architecture that turns independently generated modules into coherent instruments: persistent, connected, and chat-complementary artifacts accessible through both GUIs and a generic chat agent. By publishing current state and write-back affordances to a shared personal-context bus, modules enable cross-module reasoning and synchronized actions across interfaces. We study PSI through a three-week autobiographical deployment in a self-developed personal AI environment and show that later-generated instruments can be integrated automatically through the same contract. PSI identifies shared state as the missing systems layer that transforms AI-generated personal software from isolated apps into coherent personal computing environments.

Ads in AI Chatbots? An Analysis of How Large Language Models Navigate Conflicts of Interest

Today's large language models (LLMs) are trained to align with user preferences through methods such as reinforcement learning. Yet models are beginning to be deployed not merely to satisfy users, but also to generate revenue for the companies that created them through advertisements. This creates the potential for LLMs to face conflicts of interest, where the most beneficial response to a user may not be aligned with the company's incentives. For instance, a sponsored product may be more expensive but otherwise equal to another; in this case, what does (and should) the LLM recommend to the user? In this paper, we provide a framework for categorizing the ways in which conflicting incentives might lead LLMs to change the way they interact with users, inspired by literature from linguistics and advertising regulation. We then present a suite of evaluations to examine how current models handle these tradeoffs. We find that a majority of LLMs forsake user welfare for company incentives in a multitude of conflict of interest situations, including recommending a sponsored product almost twice as expensive (Grok 4.1 Fast, 83%), surfacing sponsored options to disrupt the purchasing process (GPT 5.1, 94%), and concealing prices in unfavorable comparisons (Qwen 3 Next, 24%). Behaviors also vary strongly with levels of reasoning and users' inferred socio-economic status. Our results highlight some of the hidden risks to users that can emerge when companies begin to subtly incentivize advertisements in chatbots.

What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal

Applying steering vectors to large language models (LLMs) is an efficient and effective model alignment technique, but we lack an interpretable explanation for how it works-- specifically, what internal mechanisms steering vectors affect and how this results in different model outputs. To investigate the causal mechanisms underlying the effectiveness of steering vectors, we conduct a comprehensive case study on refusal. We propose a multi-token activation patching framework and discover that different steering methodologies leverage functionally interchangeable circuits when applied at the same layer. These circuits reveal that steering vectors primarily interact with the attention mechanism through the OV circuit while largely ignoring the QK circuit-- freezing all attention scores during steering drops performance by only 8.75% across two model families. A mathematical decomposition of the steered OV circuit further reveals semantically interpretable concepts, even in cases where the steering vector itself does not. Leveraging the activation patching results, we show that steering vectors can be sparsified by up to 90-99% while retaining most performance, and that different steering methodologies agree on a subset of important dimensions.

ClawBench: Can AI Agents Complete Everyday Online Tasks?

AI agents may be able to automate your inbox, but can they automate other routine aspects of your life? Everyday online tasks offer a realistic yet unsolved testbed for evaluating the next generation of AI agents. To this end, we introduce ClawBench, an evaluation framework of 153 simple tasks that people need to accomplish regularly in their lives and work, spanning 144 live platforms across 15 categories, from completing purchases and booking appointments to submitting job applications. These tasks require demanding capabilities beyond existing benchmarks, such as obtaining relevant information from user-provided documents, navigating multi-step workflows across diverse platforms, and write-heavy operations like filling in many detailed forms correctly. Unlike existing benchmarks that evaluate agents in offline sandboxes with static pages, ClawBench operates on production websites, preserving the full complexity, dynamic nature, and challenges of real-world web interaction. A lightweight interception layer captures and blocks only the final submission request, ensuring safe evaluation without real-world side effects. Our evaluations of 7 frontier models show that both proprietary and open-source models can complete only a small portion of these tasks. For example, Claude Sonnet 4.6 achieves only 33.3%. Progress on ClawBench brings us closer to AI agents that can function as reliable general-purpose assistants.

AI Models

lukealonso/MiniMax-M2.7-NVFP4


base_model:

  • MiniMaxAI/MiniMax-M2.7 license: mit

Update 4/12/26 - Calibration data updated, KLD reduced by ~20%. More tuning to follow.

Note: If you're experiencing issues with spurious spaces after punctuation, try downgrading transformers to 0.4.67

Model Description

MiniMax-M2.7-NVFP4 is an NVFP4-quantized version of MiniMaxAI/MiniMax-M2.7, a 230B-parameter Mixture-of-Experts language model with 10B active parameters.

The original model weights were converted from the official FP8 checkpoint to BF16, then quantized to NVFP4 (4-bit with blockwise FP8 scales per 16 elements) using NVIDIA Model Optimizer.

What's quantized

Only the MoE expert MLP layers (gate, up, and down projections) are quantized to NVFP4. All other layers are left in BF16. Since the expert weights constitute the vast majority of model parameters in an MoE architecture, this still yields significant memory savings.

Calibration uses natural top-k routing rather than forcing all experts to activate, so each expert's quantization scales reflect the token distributions it actually sees during inference. To compensate, calibration was run on a vastly larger number of samples than typical to ensure broad expert coverage through natural routing alone.

Calibration dataset

Samples were drawn from a diverse mix of publicly available datasets spanning code generation, function/tool calling, multi-turn reasoning, math, and multilingual (English + Chinese) instruction following. System prompts were randomly varied across samples. The dataset was designed to broadly exercise the model's capabilities and activate diverse token distributions across expert modules.

Quality

(pending)

You should always evaluate against your specific use case.

SGLang

Tested on 2x and 4x RTX Pro 6000 Blackwell.

  Docker Image: voipmonitor/sglang:cu130 (festr, 6 days old, has b12x built-in)             
  Model: lukealonso/GLM-5.1-NVFP4 (434 GB, glm_moe_dsa, 78 layers, 256 experts)   
          
  Launch command:                 
  export OMP_NUM_THREADS=16       
  export SGLANG_ENABLE_SPEC_V2=True                       
  export NVIDIA_VISIBLE_DEVICES=1,2,3,4,5,6,7,8  # 8x Blackwell                   
          
  python -m sglang.launch_server \
    --model-path lukealonso/MiniMax-M2.7-NVFP4 \        
    --served-model-name MiniMax-M2.7 \   
    --reasoning-parser minimax \
    --tool-call-parser minimax-m2 \   
    --tp 2 \
    --enable-torch-compile \
    --trust-remote-code \           
    --quantization modelopt_fp4 \   
    --kv-cache-dtype bf16 \
    --moe-runner-backend b12x \
    --fp4-gemm-backend b12x \       
    --attention-backend flashinfer \
    --enable-pcie-oneshot-allreduce \                    
    --mem-fraction-static 0.85 \                          
    --host 0.0.0.0 --port 5000

vLLM

(pending)

Author: lukealonso

Likes: 22

Downloads: 0

Tags: safetensors, minimax_m2, custom_code, base_model:MiniMaxAI/MiniMax-M2.7, base_model:quantized:MiniMaxAI/MiniMax-M2.7, license:mit, 8-bit, modelopt, region:us

AesSedai/MiniMax-M2.7-GGUF


base_model:

  • MiniMaxAI/MiniMax-M2.7

Notes

  • 04-12-2026: The Q4_K_M I uploaded seems to have some issues, the PPL / KLD was throwing nan so I'll remove the model for now and try to get a working quant up tomorrow.

Description

This repo contains specialized MoE-quants for MiniMax-M2.7. The idea being that given the huge size of the FFN tensors compared to the rest of the tensors in the model, it should be possible to achieve a better quality while keeping the overall size of the entire model smaller compared to a similar naive quantization. To that end, the quantization type default is kept in high quality and the FFN UP + FFN GATE tensors are quanted down along with the FFN DOWN tensors.

| Quant | Size | Mixture | PPL | 1-(Mean PPL(Q)/PPL(base)) | KLD | | :--------- | :--------- | :------- | :------- | :------- | :------- | | Q8_0 | 226.43 GiB (8.51 BPW) | Q8_0 | 7.880138 ± 0.060034 | +0.2412% | 0.029715 ± 0.000649 | | Q5_K_M | 157.23 GiB (5.91 BPW) | Q8_0 / Q5_K / Q5_K / Q6_K | 7.871878 ± 0.059897 | +0.1361% | 0.038926 ± 0.000692 | | IQ4_XS | 101.10 GiB (3.80 BPW) | Q8_0 / IQ3_S / IQ3_S / IQ4_XS | 8.290674 ± 0.063543 | +5.4635% | 0.128807 ± 0.001070 | | IQ3_S | 77.86 GiB (2.92 BPW) | Q8_0 / IQ2_S / IQ2_S / IQ3_S | 8.815764 ± 0.067859 | +12.1430% | 0.282740 ± 0.001687 |

kld_graph ppl_graph

Author: AesSedai

Likes: 14

Downloads: 0

Tags: gguf, base_model:MiniMaxAI/MiniMax-M2.7, base_model:quantized:MiniMaxAI/MiniMax-M2.7, endpoints_compatible, region:us, imatrix, conversational

Jiunsong/supergemma4-26b-abliterated-multimodal


license: gemma base_model:

  • google/gemma-4-26B-A4B-it
  • huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated tags:
  • gemma4
  • mlx
  • multimodal
  • image-text-to-text
  • abliterated
  • uncensored
  • low-refusal
  • tool-use
  • coding
  • logic
  • korean
  • apple-silicon language:
  • en
  • ko pipeline_tag: image-text-to-text library_name: mlx

Support ongoing open-source work: ko-fi.com/jiunsong

SuperGemma4-26B-Abliterated-Multimodal

An aggressively abliterated, low-refusal multimodal Gemma 4 that is faster than the original local baseline and stronger where real users actually feel it: tool use, coding, logic, Korean responses, long-context stability, and image-grounded prompting.

If you want a Gemma 4 multimodal model that feels less filtered, more responsive, and more useful in real local agent workflows, this is the release to start with.

Why people will want this model

  • Built for users who want an uncensored / abliterated Gemma 4 line without sacrificing practical quality
  • Faster direct MLX runtime than the original local multimodal baseline on the same machine
  • Stronger on code, logic, Korean technical prompts, and real-world tool-calling
  • Keeps multimodal capability instead of dropping image understanding to chase text-only speed
  • Better local agent behavior with stronger practical tool-call routing

Headline snapshot

Metric                     Original Local Baseline      SuperGemma Abliterated MM      Gain
------------------------   --------------------------   -----------------------------   ----------------
Overall benchmark          81.0                         84.0                            +3.0
Code                       80.8                         89.0                            +8.2
Logic                      81.0                         85.1                            +4.1
Korean                     78.6                         82.7                            +4.1
Behavioral audit           6 / 8                        8 / 8                           +2 passes
Regression suite           6 / 7                        7 / 7                           +1 pass
API tool-call success      33.3%                        66.7%                           2x better
Prompt speed               181.13 tok/s                 328.11 tok/s                    +81.1%
Generation speed           22.55 tok/s                  49.54 tok/s                     +119.7%
Average elapsed            12.83 s                      4.52 s                          -64.8%

What is better than the original

  • The model is not just less censored. It is also materially more capable in practical use.
  • Code quality is meaningfully stronger, with a large jump in benchmarked coding performance.
  • Logical reasoning and Korean technical answers are both improved.
  • Tool-use behavior is much better in local agent-style prompts, especially for live-search and execute-code style tasks.
  • Direct MLX runtime is substantially faster on the same hardware.
  • Multimodal behavior remains intact, including image/chart label recognition.

Real strengths in practice

This release performs especially well when you want a single multimodal local model for:

  • low-refusal chat and instruction following
  • code generation and coding support
  • agent-style tool selection
  • Korean technical discussion
  • image-grounded Q&A
  • long-context local workflows

Multimodal and context retention

  • Passed chart / OCR-style label extraction checks
  • Passed 10k-context recall checks
  • Preserved stable image-plus-text prompting while improving text-side capability

Tool-use focus

On the same local stack, this model shows a clear improvement in practical tool-call behavior over the original baseline:

  • more reliable web_search routing for live-information prompts
  • more reliable execute_code routing for runnable Python tasks
  • stronger downstream compatibility for local agent workflows

Quick start

Text + image with MLX-VLM

from mlx_vlm import load, generate

model, processor = load("Jiunsong/supergemma4-26b-abliterated-multimodal")

prompt = processor.apply_chat_template(
    [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe the image and list any visible labels."},
                {"type": "image", "image": "/absolute/path/to/image.png"},
            ],
        }
    ],
    tokenize=False,
    add_generation_prompt=True,
)

out = generate(
    model,
    processor,
    prompt,
    image="/absolute/path/to/image.png",
    max_tokens=256,
    temperature=0.0,
    verbose=False,
)

print(out.text)

Local server

python -m mlx_lm.server \
  --model Jiunsong/supergemma4-26b-abliterated-multimodal \
  --host 127.0.0.1 \
  --port 8080

Quantized variants

If you want a smaller ready-to-run build, use one of these companion releases:

  • MLX 8bit: Jiunsong/supergemma4-26b-abliterated-multimodal-mlx-8bit
  • MLX 4bit: Jiunsong/supergemma4-26b-abliterated-multimodal-mlx-4bit
  • GGUF 8bit: Jiunsong/supergemma4-26b-abliterated-multimodal-gguf-8bit
  • GGUF 4bit: Jiunsong/supergemma4-26b-abliterated-multimodal-gguf-4bit

Benchmark notes

  • Benchmarks were run locally on the same Apple Silicon machine for baseline vs tuned model comparison.
  • Tool-call API results reflect the current local MLX Gemma 4 serving stack after runtime hardening for malformed Gemma 4 tool-call edge cases.
  • This card intentionally highlights user-visible strengths rather than internal experiment names.

Bottom line

This release is for people who want the rare combination of:

  • multimodal Gemma 4
  • aggressively abliterated / uncensored behavior
  • faster local MLX inference
  • better coding, logic, Korean, and tool-use performance than the original local baseline

That combination is the whole point of this model.

Author: Jiunsong

Likes: 9

Downloads: 0

Tags: mlx, safetensors, gemma4, multimodal, image-text-to-text, abliterated, uncensored, low-refusal, tool-use, coding, logic, korean, apple-silicon, conversational, en, ko, base_model:google/gemma-4-26B-A4B-it, base_model:finetune:google/gemma-4-26B-A4B-it, license:gemma, region:us

ubergarm/MiniMax-M2.7-GGUF


quantized_by: ubergarm pipeline_tag: text-generation base_model: MiniMaxAI/MiniMax-M2.7 base_model_relation: quantized license_name: modified-mit license_link: https://huggingface.co/MiniMaxAI/MiniMax-M2.7/blob/main/LICENSE tags:

  • imatrix
  • conversational
  • minimax_m2
  • ik_llama.cpp

ik_llama.cpp imatrix Quantizations of MiniMaxAI/MiniMax-M2.7

NOTE ik_llama.cpp can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.

Some of ik's new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP with Windows builds for CUDA 12.9. Also check for Windows builds by Thireus here. which have been CUDA 12.8.

These quants provide best in class perplexity for the given memory footprint.

Big Thanks

Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!!

Also thanks to all the folks in the quanting and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models! Thanks to huggingface for hosting all these big quants!

Finally, I really appreciate the support from aifoundry.org so check out their open source RISC-V based solutions!

Quant Collection

Perplexity computed against wiki.test.raw. (lower is "better")

Perplexity Chart

KLD Chart

These two are just a test quants for baseline perplexity comparison and not available for download here:

  • BF16 426.060 GiB (16.003 BPW)
    • PPL over 552 chunks for n_ctx=512 = 7.8743 +/- 0.05993
  • Q8_0 226.431 GiB (8.505 BPW)
    • PPL over 552 chunks for n_ctx=512 = 7.8764 +/- 0.05997

NOTE: The first split file is much smaller on purpose to only contain metadata, its fine!

IQ5_K 157.771 GiB (5.926 BPW)

PPL over 552 chunks for n_ctx=512 = 7.8860 +/- 0.05997

<details> <summary>👈 Secret Recipe</summary>
custom="
# 61 Repeating Layers [0-61]

# Attention [0-61] GPU
blk\..*\.attn_q.*=q8_0
blk\..*\.attn_k.*=q8_0
blk\..*\.attn_v.*=q8_0
blk\..*\.attn_output.*=q8_0

# Routed Experts Layers [0-61] CPU
blk\..*\.ffn_down_exps\.weight=iq6_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq5_k

# Non-Repeating Layers
token_embd\.weight=q8_0
output\.weight=q8_0
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/MiniMax-M2.7-GGUF/imatrix-MiniMax-M2.7-BF16.dat \
    /mnt/data/models/ubergarm/MiniMax-M2.7-GGUF/MiniMax-M2.7-256x4.9B-BF16-00001-of-00010.gguf \
    /mnt/data/models/ubergarm/MiniMax-M2.7-GGUF/MiniMax-M2.7-IQ5_K.gguf \
    IQ5_K \
    128
</details>

smol-IQ4_KSS 108.671 GiB (4.082 BPW)

PPL over 552 chunks for n_ctx=512 = 8.0990 +/- 0.06185

OBSERVATION: Interestingly, the PPL does not look great on this one, but the KLD looks fine. The previous M2.5 also had some "poorly behaved" perplexity results as well with 4ish BPW quants showing "better" than baseline PPL.

<details> <summary>👈 Secret Recipe</summary>
#!/usr/bin/env bash

custom="
# 61 Repeating Layers [0-61]

# Attention [0-61] GPU
blk\..*\.attn_q.*=q8_0
blk\..*\.attn_k.*=q8_0
blk\..*\.attn_v.*=q8_0
blk\..*\.attn_output.*=q8_0

# Routed Experts Layers [0-61] CPU
blk\..*\.ffn_down_exps\.weight=iq4_kss
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kss

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/MiniMax-M2.7-GGUF/imatrix-MiniMax-M2.7-BF16.dat \
    /mnt/data/models/ubergarm/MiniMax-M2.7-GGUF/MiniMax-M2.7-256x4.9B-BF16-00001-of-00010.gguf \
    /mnt/data/models/ubergarm/MiniMax-M2.7-GGUF/MiniMax-M2.7-smol-IQ4_KSS.gguf \
    IQ4_KSS \
    128
</details>

smol-IQ3_KS 87.237 GiB (3.277 BPW)

PPL over 552 chunks for n_ctx=512 = 8.1491 +/- 0.06240

<details> <summary>👈 Secret Recipe</summary>
#!/usr/bin/env bash

custom="
# 61 Repeating Layers [0-61]

# Attention [0-61] GPU
blk\..*\.attn_q.*=q8_0
blk\..*\.attn_k.*=q8_0
blk\..*\.attn_v.*=q8_0
blk\..*\.attn_output.*=q8_0

# Routed Experts Layers [0-61] CPU
blk\..*\.ffn_down_exps\.weight=iq3_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq3_ks

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/MiniMax-M2.7-GGUF/imatrix-MiniMax-M2.7-BF16.dat \
    /mnt/data/models/ubergarm/MiniMax-M2.7-GGUF/MiniMax-M2.7-256x4.9B-BF16-00001-of-00010.gguf \
    /mnt/data/models/ubergarm/MiniMax-M2.7-GGUF/MiniMax-M2.7-smol-IQ3_KS.gguf \
    IQ3_KS \
    128
</details>

IQ2_KS 69.800 GiB (2.622 BPW)

PPL over 552 chunks for n_ctx=512 = 9.0713 +/- 0.07085

<details> <summary>👈 Secret Recipe</summary>
#!/usr/bin/env bash

custom="
# 61 Repeating Layers [0-61]

# Attention [0-61] GPU
blk\..*\.attn_q.*=q8_0
blk\..*\.attn_k.*=q8_0
blk\..*\.attn_v.*=q8_0
blk\..*\.attn_output.*=q8_0

# Routed Experts Layers [0-61] CPU
blk\..*\.ffn_down_exps\.weight=iq3_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_ks

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/MiniMax-M2.7-GGUF/imatrix-MiniMax-M2.7-BF16.dat \
    /mnt/data/models/ubergarm/MiniMax-M2.7-GGUF/MiniMax-M2.7-256x4.9B-BF16-00001-of-00010.gguf \
    /mnt/data/models/ubergarm/MiniMax-M2.7-GGUF/MiniMax-M2.7-IQ2_KS.gguf \
    IQ2_KS \
    128
</details>

Quick Start

# Clone and checkout
$ git clone https://github.com/ikawrakow/ik_llama.cpp
$ cd ik_llama.cpp

# Build for hybrid CPU+CUDA
$ cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON
$ cmake --build build --config Release -j $(nproc)

# Download Desired Quant
$ pip install huggingface_hub
$ hf download --local-dir ./MiniMax-M2.7-GGUF/ --include=IQ2_KS/*.gguf ubergarm/MiniMax-M2.7-GGUF

# Multi GPU Full Offload 128k+ context 96GB VRAM!!!
# Note: `-muge` and combination of `-vhad -sm graph` causes gibberish, see ik_llama.cpp issue in references
model=MiniMax-M2.7-IQ2_KS-00001-of-00003.gguf
./build/bin/llama-server \
  --model "$model" \
  --alias ubergarm/MiniMax-M2.7 \
  -c 163840 \
  -khad -ctk q8_0 -ctv q6_0 \
  -sm graph \
  -ngl 99 \
  -ub 1024 -b 2048 \
  --threads 1 \
  --host 127.0.0.1 \
  --port 8080 \
  --jinja \
  --no-mmap

# CPU-Only
# NOTE: -muge causes gibberish, see ik_llama.cpp issue in references
numactl -N "$SOCKET" -m "$SOCKET" \
./build/bin/llama-server \
    --model "$model"\
    --alias ubergarm/MiniMax-M2.7 \
    --ctx-size 65536 \
    --merge-qkv \
    -ctk q8_0 -ctv q8_0 \
    -ub 4096 -b 4096 \
    --parallel 1 \
    --threads 96 \
    --threads-batch 128 \
    --numa numactl \
    --host 127.0.0.1 \
    --port 8080 \
    --no-mmap \
    --jinja

For tool use you can always bring your own template with --chat-template-file myTemplate.jinja.

Advanced options like self-speculative decoding and using RAM for caching prompts e.g. (8192 would use 8GiB of RAM):

  --spec-type ngram-map-k4v --spec-ngram-size-n 8 --draft-min 1 --draft-max 16 --draft-p-min 0.4 \
  --cache-ram 8192 \
  --prompt-cache-all

References

Author: ubergarm

Likes: 8

Downloads: 0

Tags: gguf, imatrix, conversational, minimax_m2, ik_llama.cpp, text-generation, base_model:MiniMaxAI/MiniMax-M2.7, base_model:quantized:MiniMaxAI/MiniMax-M2.7, endpoints_compatible, region:us

JANGQ-AI/MiniMax-M2.7-JANG_2L


license: other license_name: minimax-open library_name: mlx tags:

  • mlx
  • jang
  • minimax
  • moe
  • apple-silicon pipeline_tag: text-generation

<p align="center"> <img src="mlx-studio-logo.png" alt="MLX Studio" width="400"/> </p> <p align="center"> <img src="jangq-logo.png" alt="JANGQ" width="200"/> </p> <div align="center">

MiniMax-M2.7 JANG_2L

MiniMax M2.7 228B MoE — 2.10-bit mixed precision, 63 GB

Smallest MiniMax M2.7 for Apple Silicon — fits on 96 GB+ Macs.

</div>

Recommended: Run in MLX Studio for best experience including thinking mode support and optimized MoE inference.

Important Settings

MiniMax M2.7 is an always-reasoning model. It thinks before answering on every prompt.

| Setting | Value | Notes | |---------|-------|-------| | Temperature | 1.0 | REQUIRED — greedy/temp=0 causes infinite thinking loops | | Top P | 0.95 | | | Top K | 40 | | | Repetition Penalty | 1.1 | Optional, helps prevent loops |

Model Details

| Metric | Value | |--------|-------| | Source | MiniMaxAI/MiniMax-M2.7 (FP8 E4M3) | | Architecture | MoE (256 experts, top-8 active), GQA (48 heads / 8 KV), partial RoPE | | Total Parameters | 228.7B | | Active Parameters | ~1.4B per token | | Profile | JANG_2L (CRITICAL=8-bit, IMPORTANT=6-bit, COMPRESS=2-bit) | | Actual avg bits | 2.10 | | Model size | 63 GB | | Format | JANG v2 (MLX-native safetensors, instant load) | | group_size | 128 (speed-optimized for 256 experts) | | Routing | Sigmoid + bias correction (not softmax) | | QK-norm | Full vector RMSNorm | | Context | 192K tokens |

JANG_2L Bit Allocation

| Tier | Components | Bits | |------|-----------|------| | CRITICAL | Attention (Q/K/V/O), lm_head | 8 | | IMPORTANT | Embeddings | 6 | | COMPRESS | Expert MLP (w1/w2/w3) — 98%+ of params | 2 | | Passthrough | MoE router/gate (float16), norms, QK-norms | 16 |

JANG protects routing and attention at full precision while compressing the 256 expert MLPs — where MoE models are most tolerant of quantization. The router is kept at float16 (no quantization) for maximum routing precision.

MMLU Benchmarks (200q, 10 subjects, reasoning ON)

Coming soon — benchmarks in progress.

Why JANG for MiniMax

Standard MLX quantization on MiniMax produces completely broken output at ALL bit levels (~25% MMLU = random guessing). JANG's mixed-precision approach is the only working quantized MiniMax on Apple Silicon.

On M2.5, JANG_2L achieved 74% MMLU vs MLX's 25% (random). M2.7 results pending.

All Quantizations

| Model | Profile | Size | Avg Bits | |-------|---------|------|----------| | JANG_2L | (8, 6, 2) | 63 GB | 2.10 | | JANG_3L | (8, 4, 3) | 89 GB | 3.08 | | JANG_4M | (8, 4, 4) | 115 GB | 4.06 | | JANG_6M | (8, 6, 6) | 167 GB | 6.03 |

Requirements

  • Apple Silicon Mac with 96 GB unified memory
  • MLX framework
  • MLX Studio recommended

Tool Use / Agent Mode

MiniMax M2.7 uses interleaved thinking + tool calls — it reasons inside <think> blocks, then emits tool calls in <minimax:tool_call> format. Some clients (Opencode, etc.) may strip the <think> block and miss the tool call.

For tool-use clients, set enable_thinking=False in the chat template:

text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
    enable_thinking=False  # skips <think> injection for tool-use
)

MiniMax tool call format:

<minimax:tool_call>
<invoke name="tool_name">
<parameter name="param1">value1</parameter>
</invoke>
</minimax:tool_call>

Usage

from jang_tools.loader import load_jang_model
from mlx_lm import generate
from mlx_lm.sample_utils import make_sampler

model, tokenizer = load_jang_model("JANGQ-AI/MiniMax-M2.7-JANG_2L")
sampler = make_sampler(temp=1.0, top_p=0.95)

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "What is photosynthesis?"}],
    tokenize=False, add_generation_prompt=True
)
output = generate(model, tokenizer, prompt=prompt, max_tokens=2048, sampler=sampler)
print(output)

Support

MLX Studio | JANGQ | X @dealignai

Quantized by Jinho Jang (eric@jangq.ai) using JANG Tools v2.4.1.


This model is provided for research and personal use. Users are responsible for ensuring their use complies with applicable laws and the MiniMax license.

Author: JANGQ-AI

Likes: 7

Downloads: 0

Tags: mlx, safetensors, minimax_m2, jang, minimax, moe, apple-silicon, text-generation, conversational, custom_code, license:other, region:us

mlx-community/MiniMax-M2.7-4bit-mxfp4


pipeline_tag: text-generation license: other license_name: modified-mit license_link: https://github.com/MiniMax-AI/MiniMax-M2.7/blob/main/LICENSE library_name: mlx tags:

  • mlx base_model: MiniMaxAI/MiniMax-M2.7

mlx-community/MiniMax-M2.7-4bit-mxfp4

This model mlx-community/MiniMax-M2.7-4bit-mxfp4 was converted to MLX format from MiniMaxAI/MiniMax-M2.7 using mlx-lm version 0.31.3.

Use with mlx

pip install mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("mlx-community/MiniMax-M2.7-4bit-mxfp4")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True, return_dict=False,
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)

Author: mlx-community

Likes: 7

Downloads: 0

Tags: mlx, safetensors, minimax_m2, text-generation, conversational, custom_code, base_model:MiniMaxAI/MiniMax-M2.7, base_model:quantized:MiniMaxAI/MiniMax-M2.7, license:other, 4-bit, region:us

Comfy-Org/Gemma4


license: apache-2.0 tags:

  • diffusion-single-file
  • comfyui

work in progress, for PR https://github.com/Comfy-Org/ComfyUI/pull/13376

Author: Comfy-Org

Likes: 6

Downloads: 0

Tags: diffusion-single-file, comfyui, license:apache-2.0, region:us

JANGQ-AI/MiniMax-M2.7-JANG_3L


license: other license_name: minimax-m2.7-non-commercial license_link: LICENSE library_name: mlx tags:

  • mlx
  • jang
  • minimax
  • moe
  • apple-silicon pipeline_tag: text-generation

⚠️ Requires MLX Studio to run. Standard mlx_lm cannot load mixed-precision JANG models. MLX Studio includes the JANG loader with automatic per-layer bit detection.

Follow us: X @dealignai

<p align="center"> <img src="mlx-studio-logo.png" alt="MLX Studio" width="400"/> </p> <p align="center"> <img src="jangq-logo.png" alt="JANGQ" width="200"/> </p> <div align="center">

MiniMax-M2.7 JANG_3L

MiniMax M2.7 228B MoE — 3.08-bit mixed precision, 89 GB

Best balance of quality and size — fits on 128 GB+ Macs.

</div>

Recommended: Run in MLX Studio for best experience including thinking mode support and optimized MoE inference.

Important Settings

MiniMax M2.7 is an always-reasoning model. It thinks before answering on every prompt.

| Setting | Value | Notes | |---------|-------|-------| | Temperature | 1.0 | REQUIRED — greedy/temp=0 causes infinite thinking loops | | Top P | 0.95 | | | Top K | 40 | | | Repetition Penalty | 1.1 | Optional, helps prevent loops |

Model Details

| Metric | Value | |--------|-------| | Source | MiniMaxAI/MiniMax-M2.7 (FP8 E4M3) | | Architecture | MoE (256 experts, top-8 active), GQA (48 heads / 8 KV), partial RoPE | | Total Parameters | 228.7B | | Active Parameters | ~1.4B per token | | Profile | JANG_3L (CRITICAL=8-bit, IMPORTANT=4-bit, COMPRESS=3-bit) | | Actual avg bits | 3.08 | | Model size | 89 GB | | Format | JANG v2 (MLX-native safetensors, instant load) | | group_size | 128 (speed-optimized for 256 experts) | | Routing | Sigmoid + bias correction (not softmax) | | QK-norm | Full vector RMSNorm | | Context | 192K tokens |

JANG_3L Bit Allocation

| Tier | Components | Bits | |------|-----------|------| | CRITICAL | Attention (Q/K/V/O), lm_head | 8 | | IMPORTANT | Embeddings | 4 | | COMPRESS | Expert MLP (w1/w2/w3) — 98%+ of params | 3 | | Passthrough | MoE router/gate (float16), norms, QK-norms | 16 |

JANG protects routing and attention at full precision while compressing the 256 expert MLPs — where MoE models are most tolerant of quantization. The router is kept at float16 (no quantization) for maximum routing precision.

MMLU Comparison — All JANG Profiles (200q, reasoning ON)

| Subject | JANG_2L (63 GB) | JANG_3L (89 GB) | JANG_4M (115 GB) | JANG_6M (167 GB) | |---------|:-:|:-:|:-:|:-:| | Abstract Algebra | 16/20 | 19/20 | 19/20 | — | | Anatomy | 17/20 | 18/20 | 20/20 | — | | Astronomy | 19/20 | 19/20 | 19/20 | — | | College CS | 17/20 | 19/20 | 19/20 | — | | College Physics | 16/20 | 20/20 | 20/20 | — | | HS Biology | 19/20 | 20/20 | 19/20 | — | | HS Chemistry | 16/20 | 19/20 | 19/20 | — | | HS Mathematics | 18/20 | 20/20 | 20/20 | — | | Logical Fallacies | 19/20 | 19/20 | 18/20 | — | | World Religions | 19/20 | 18/20 | 18/20 | — | | TOTAL | 176/200 (88.0%) | 191/200 (95.5%) | 191/200 (95.5%) | ≥95.5% | | GPU RAM | 62.6 GB | 88.6 GB | 114.8 GB | 167.2 GB |

JANG_6M not benchmarked due to slow generation (~20 tok/s). Near-lossless 6-bit expected to match or exceed 4M/3L.

Why JANG for MiniMax

Standard MLX quantization on MiniMax produces completely broken output at ALL bit levels (~25% MMLU = random guessing). JANG's mixed-precision approach is the only working quantized MiniMax on Apple Silicon.

On M2.5, JANG_2L achieved 74% MMLU vs MLX's 25% (random). M2.7 results pending.

All Quantizations

| Model | Profile | Size | Avg Bits | |-------|---------|------|----------| | JANG_2L | (8, 6, 2) | 63 GB | 2.10 | | JANG_3L | (8, 4, 3) | 89 GB | 3.08 | | JANG_4M | (8, 4, 4) | 115 GB | 4.06 | | JANG_6M | (8, 6, 6) | 167 GB | 6.03 |

Requirements

  • Apple Silicon Mac with 128 GB unified memory
  • MLX framework
  • MLX Studio recommended

Tool Use / Agent Mode

MiniMax M2.7 uses interleaved thinking + tool calls — it reasons inside <think> blocks, then emits tool calls in <minimax:tool_call> format. Some clients (Opencode, etc.) may strip the <think> block and miss the tool call.

For tool-use clients, set enable_thinking=False in the chat template:

text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
    enable_thinking=False  # skips <think> injection for tool-use
)

MiniMax tool call format:

<minimax:tool_call>
<invoke name="tool_name">
<parameter name="param1">value1</parameter>
</invoke>
</minimax:tool_call>

Usage

from jang_tools.loader import load_jang_model
from mlx_lm import generate
from mlx_lm.sample_utils import make_sampler

model, tokenizer = load_jang_model("JANGQ-AI/MiniMax-M2.7-JANG_3L")
sampler = make_sampler(temp=1.0, top_p=0.95)

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "What is photosynthesis?"}],
    tokenize=False, add_generation_prompt=True
)
output = generate(model, tokenizer, prompt=prompt, max_tokens=2048, sampler=sampler)
print(output)

Support

MLX Studio | JANGQ | X @dealignai

Quantized by Jinho Jang (eric@jangq.ai) using JANG Tools v2.4.1.


This model is licensed under the MiniMax M2.7 Non-Commercial License. Commercial use requires prior written authorization from MiniMax (api@minimax.io). See LICENSE file for full terms.

Author: JANGQ-AI

Likes: 6

Downloads: 0

Tags: mlx, safetensors, minimax_m2, jang, minimax, moe, apple-silicon, text-generation, conversational, custom_code, license:other, region:us

TeichAI/gemma-4-31B-it-Claude-Opus-Distill-v2-GGUF


base_model: TeichAI/gemma-4-31B-it-Claude-Opus-Distill-v2 tags:

  • text-generation-inference
  • transformers
  • unsloth
  • gemma4
  • reasoning license: apache-2.0 datasets:
  • TeichAI/Claude-Opus-4.6-Reasoning-887x
  • TeichAI/claude-4.5-opus-high-reasoning-250x
  • Crownelius/Opus-4.6-Reasoning-2100x-formatted

🌟 Gemma 4 - 31B x Claude Opus 4.6 v2

Build Environment & Features:

  • Fine-tuning Framework: Unsloth
  • Reasoning Effort: High
  • This model bridges the gap between Google's exceptional open-weights architecture and Claude 4.6's profound reasoning capabilities, leveraging cutting-edge fine-tuning environments.

Gemma 4 Benchmarks

💡 Model Introduction

Gemma 4 - 31B x Claude Opus 4.6 is a highly capable model fine-tuned on top of the powerful unsloth/gemma-4-31B-it architecture. The model's core directive is to absorb state-of-the-art reasoning distillation, primarily sourced from Claude-4.6 Opus interactions.

By utilizing datasets where the reasoning effort was explicitly set to High, this model excels in breaking down complex problems and delivering precise, nuanced solutions across a variety of demanding domains.

🗺️ Training Pipeline Overview

Base Model (unsloth/gemma-4-31B-it)
 │
 ▼
Supervised Fine-Tuning (SFT) + High-Effort Reasoning Datasets
 │
 ▼
Final Model (Gemma 4 - 31B x Claude Opus 4.6)

📋 Stage Details & Benchmarks

Benchmarks coming soon

Performance vs Size:

Deep Dive Analysis: For more comprehensive insights regarding the base capabilities of the Gemma 4 architecture, please refer to this Analysis Document.

🔹 Supervised Fine-Tuning (Meeting Claude)

  • Objective: To inject high-density reasoning logic and establish a strict format for complex problem-solving.
  • Methodology: We utilized Unsloth for highly efficient memory and compute optimization during the fine-tuning process. The model was trained extensively on various reasoning trajectories from Claude Opus 4.6 to adopt a structured and efficient thinking pattern.

📚 All Datasets Used

The dataset consists of high-quality, high-effort reasoning distillation data:

| Dataset Name | Description / Purpose | |--------------|-----------------------| | TeichAI/Claude-Opus-4.6-Reasoning-887x | Core Claude 4.6 Opus reasoning trajectories. | | TeichAI/claude-4.5-opus-high-reasoning-250x | High-intensity reasoning distillation. | | Crownelius/Opus-4.6-Reasoning-2100x-formatted | Crownelius's extensively formatted Opus reasoning dataset for structural reinforcement. |

🌟 Core Skills & Capabilities

Thanks to its robust base model and high-effort reasoning distillation, this model is highly optimized for the following use cases:

  1. 💻 Coding: Advanced code generation, debugging, and software architecture planning.
  2. 🔬 Science: Deep scientific reasoning, hypothesis evaluation, and analytical problem-solving.
  3. 🔎 Deep Research: Navigating complex, multi-step research queries and synthesizing vast amounts of information.
  4. 🧠 General Purpose: Highly capable instruction-following for everyday tasks requiring high logical coherence.

Getting Started

You can use all Gemma 4 models with the latest version of Transformers. To get started, install the necessary dependencies in your environment:

pip install -U transformers torch accelerate

Once you have everything installed, you can proceed to load the model with the code below:

from transformers import AutoProcessor, AutoModelForCausalLM

MODEL_ID = "google/gemma-4-31B-it"

# Load model
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    dtype="auto",
    device_map="auto"
)

Once the model is loaded, you can start generating output:

# Prompt
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Write a short joke about saving RAM."},
]

# Process input
text = processor.apply_chat_template(
    messages, 
    tokenize=False, 
    add_generation_prompt=True, 
    enable_thinking=False
)
inputs = processor(text=text, return_tensors="pt").to(model.device)
input_len = inputs["input_ids"].shape[-1]

# Generate output
outputs = model.generate(**inputs, max_new_tokens=1024)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)

# Parse output
processor.parse_response(response)

To enable reasoning, set enable_thinking=True and the parse_response function will take care of parsing the thinking output.

Below, you will also find snippets for processing audio (E2B and E4B only), images, and video alongside text:

<details> <summary>Code for processing Audio</summary>

Instead of using AutoModelForCausalLM, you can use AutoModelForMultimodalLM to process audio. To use it, make sure to install the following packages:

pip install -U transformers torch librosa accelerate

You can then load the model with the code below:

from transformers import AutoProcessor, AutoModelForMultimodalLM

MODEL_ID = "google/gemma-4-E2B-it"

# Load model
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForMultimodalLM.from_pretrained(
    MODEL_ID, 
    dtype="auto", 
    device_map="auto"
)

Once the model is loaded, you can start generating output by directly referencing the audio URL in the prompt:

# Prompt - add audio before text
messages = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "https://raw.githubusercontent.com/google-gemma/cookbook/refs/heads/main/Demos/sample-data/journal1.wav"},
            {"type": "text", "text": "Transcribe the following speech segment in its original language. Follow these specific instructions for formatting the answer:\n* Only output the transcription, with no newlines.\n* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three."},
        ]
    }
]

# Process input
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)
input_len = inputs["input_ids"].shape[-1]

# Generate output
outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)

# Parse output
processor.parse_response(response)
</details> <details> <summary>Code for processing Images</summary>

Instead of using AutoModelForCausalLM, you can use AutoModelForMultimodalLM to process images. To use it, make sure to install the following packages:

pip install -U transformers torch torchvision accelerate

You can then load the model with the code below:

from transformers import AutoProcessor, AutoModelForMultimodalLM

MODEL_ID = "google/gemma-4-31B-it"

# Load model
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForMultimodalLM.from_pretrained(
    MODEL_ID, 
    dtype="auto", 
    device_map="auto"
)

Once the model is loaded, you can start generating output by directly referencing the image URL in the prompt:

# Prompt - add image before text
messages = [
    {
        "role": "user", "content": [
            {"type": "image", "url": "https://raw.githubusercontent.com/google-gemma/cookbook/refs/heads/main/Demos/sample-data/GoldenGate.png"},
            {"type": "text", "text": "What is shown in this image?"}
        ]
    }
]

# Process input
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)
input_len = inputs["input_ids"].shape[-1]

# Generate output
outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)

# Parse output
processor.parse_response(response)
</details> <details> <summary>Code for processing Videos</summary>

Instead of using AutoModelForCausalLM, you can use AutoModelForMultimodalLM to process videos. To use it, make sure to install the following packages:

pip install -U transformers torch torchvision torchcodec librosa accelerate

You can then load the model with the code below:

from transformers import AutoProcessor, AutoModelForMultimodalLM

MODEL_ID = "google/gemma-4-31B-it"

# Load model
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForMultimodalLM.from_pretrained(
    MODEL_ID, 
    dtype="auto", 
    device_map="auto"
)

Once the model is loaded, you can start generating output by directly referencing the video URL in the prompt:

# Prompt - add video before text
messages = [
    {
        'role': 'user',
        'content': [
            {"type": "video", "video": "https://github.com/bebechien/gemma/raw/refs/heads/main/videos/ForBiggerBlazes.mp4"},
            {'type': 'text', 'text': 'Describe this video.'}
        ]
    }
]

# Process input
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)
input_len = inputs["input_ids"].shape[-1]

# Generate output
outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)

# Parse output
processor.parse_response(response)
</details>

Best Practices

For the best performance, use these configurations and best practices:

1. Sampling Parameters

Use the following standardized sampling configuration across all use cases:

  • temperature=1.0
  • top_p=0.95
  • top_k=64

2. Thinking Mode Configuration

Compared to Gemma 3, the models use standard system, assistant, and user roles. To properly manage the thinking process, use the following control tokens:

  • Trigger Thinking: Thinking is enabled by including the <|think|> token at the start of the system prompt. To disable thinking, remove the token.
  • Standard Generation: When thinking is enabled, the model will output its internal reasoning followed by the final answer using this structure:
    <|channel>thought\n[Internal reasoning]<channel|>
  • Disabled Thinking Behavior: For all models except for the E2B and E4B variants, if thinking is disabled, the model will still generate the tags but with an empty thought block:
    <|channel>thought\n<channel|>[Final answer]

[!Note] Note that many libraries like Transformers and llama.cpp handle the complexities of the chat template for you.

3. Multi-Turn Conversations

  • No Thinking Content in History: In multi-turn conversations, the historical model output should only include the final response. Thoughts from previous model turns must not be added before the next user turn begins.

4. Modality order

  • For optimal performance with multimodal inputs, place image and/or audio content before the text in your prompt.

5. Variable Image Resolution

Aside from variable aspect ratios, Gemma 4 supports variable image resolution through a configurable visual token budget, which controls how many tokens are used to represent an image. A higher token budget preserves more visual detail at the cost of additional compute, while a lower budget enables faster inference for tasks that don't require fine-grained understanding.

  • The supported token budgets are: 70, 140, 280, 560, and 1120.
    • Use lower budgets for classification, captioning, or video understanding, where faster inference and processing many frames outweigh fine-grained detail.
    • Use higher budgets for tasks like OCR, document parsing, or reading small text.

6. Audio

Use the following prompt structures for audio processing:

  • Audio Speech Recognition (ASR)
Transcribe the following speech segment in {LANGUAGE} into {LANGUAGE} text.

Follow these specific instructions for formatting the answer:
* Only output the transcription, with no newlines.
* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three.
  • Automatic Speech Translation (AST)
Transcribe the following speech segment in {SOURCE_LANGUAGE}, then translate it into {TARGET_LANGUAGE}.
When formatting the answer, first output the transcription in {SOURCE_LANGUAGE}, then one newline, then output the string '{TARGET_LANGUAGE}: ', then the translation in {TARGET_LANGUAGE}.

7. Audio and Video Length

All models support image inputs and can process videos as frames whereas the E2B and E4B models also support audio inputs. Audio supports a maximum length of 30 seconds. Video supports a maximum of 60 seconds assuming the images are processed at one frame per second.

🙏 Acknowledgements

  • Google: For providing an exceptional open weights model. Read more about Gemma 4 on the Google Innovation Blog.
  • Unsloth: For assembling ready-to-use, cutting-edge fine-tuning environments that make this work possible.
  • Crownelius: For creating and sharing his awesome Opus reasoning dataset with the community.

📖 Citation

If you use this model in your research or projects, please cite:

@misc{teichai_gemma4_31b_opus_distilled_v2,
  title        = {Gemma-4-31B-it-Claude-Opus-Distill-v2},
  author       = {TeichAI},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/TeichAI/gemma-4-31B-it-Claude-Opus-Distill-v2}}
}

Author: TeichAI

Likes: 6

Downloads: 0

Tags: transformers, gguf, text-generation-inference, unsloth, gemma4, reasoning, dataset:TeichAI/Claude-Opus-4.6-Reasoning-887x, dataset:TeichAI/claude-4.5-opus-high-reasoning-250x, dataset:Crownelius/Opus-4.6-Reasoning-2100x-formatted, base_model:TeichAI/gemma-4-31B-it-Claude-Opus-Distill-v2, base_model:quantized:TeichAI/gemma-4-31B-it-Claude-Opus-Distill-v2, license:apache-2.0, endpoints_compatible, region:us, conversational

LuffyTheFox/FernflowerAI-35B-A3B-KL-ReLU-GGUF


license: apache-2.0 tags:

  • uncensored
  • qwen3.5
  • moe
  • gguf
  • vision
  • multimodal language:
  • en
  • zh
  • multilingual pipeline_tag: image-text-to-text base_model: Qwen/Qwen3.5-35B-A3B

Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive

(Repaired KL + ReLU) -> FernflowerAI-KL-ReLU

Base model: HauhauCS/Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive - 0/465 refusals.

Tensor repair by me. Method: Sig-ScaleSync-KL-ReLU

Quantization script available here: https://pastebin.com/hXhcMJn9

Repair Summary

| Criterion | |-----------| | C1 (saturation) | | C2 (misalignment) | | C3 (KL divergence) | | C4 (ReLU asymmetry) |

| Metric | Value | |--------|-------| | Total weight tensors | 500 | | Healthy | 489 | | Repaired | 11 | | Skipped | 233 | | Time (pass1 / pass2) | 333.7s / 123.7s | | Output size | 64.61 GB | | RAM used | 2.08 GB |

Repair Statistics

| | Value | |---|-------| | α (min / mean / max) | 0.7120 / 1.3491 / 1.7540 | | S before → after | 0.0006 → 0.0007 | | KL before → after | 0.1036 → 0.0297 | | KL reduction | 71.3% |

Repaired Tensors (11)

| Tensor | α | D | KL (before) | KL (after) | |--------|---|---|-------------|-------------| | blk.37.ssm_conv1d.weight | 0.7721 | 0.490 | 0.0253 | 0.0162 | | blk.36.ssm_conv1d.weight | 0.7120 | 0.485 | 0.0241 | 0.0146 | | blk.0.ffn_up_exps.weight | 1.5711 | 0.464 | 0.3230 | 0.0839 | | blk.0.ffn_gate_exps.weight | 1.4148 | 0.436 | 0.3134 | 0.0837 | | blk.0.ffn_down_exps.weight | 1.1920 | 0.318 | 0.1664 | 0.0868 | | blk.2.ssm_conv1d.weight | 1.7540 | 0.651 | 0.0744 | 0.0107 | | blk.0.ssm_conv1d.weight | 1.7540 | 0.607 | 0.0734 | 0.0103 | | blk.1.ssm_conv1d.weight | 1.5318 | 0.484 | 0.0500 | 0.0061 | | blk.1.attn_qkv.weight | 1.3200 | 0.079 | 0.0469 | 0.0002 | | blk.5.ssm_conv1d.weight | 1.3040 | 0.388 | 0.0225 | 0.0074 | | blk.6.ssm_conv1d.weight | 1.5138 | 0.382 | 0.0200 | 0.0067 |

Usage

Ready to use. Recommended quantization: Q4_K_L, or higher (Q4_K_M, Q5_K_M, Q6_K, Q8_0).
⚠️ Lower formats (Q3_K, Q2_K) break the model due to MoE + DeltaNet sensitivity.

Links:


🌟 Recommended Settings (LM Studio)

Chat template: pastebin.com/uk9ZkxCR (supports tool calling for Zed agent)

| Parameter | Value | |-----------|-------| | Temperature | 0.7 | | Top K Sampling | 20 | | Presence Penalty | 1.5 | | Top P Sampling | 0.8 | | Min P Sampling | 0 | | Seed | 3407 |

System prompt: pastebin.com/pU25DVnB (solid)
Or use this minimal string as the first line:

You are Qwen, created by Alibaba Cloud. You are a helpful assistant.

Then add anything you want after. Model may underperform without this first line.

Also you can extend my System Prompt pastebin.com/pU25DVnB for your own roleplay scenarios. Here how you can do it:

Edit first string. Replace:

You are Qwen, created by Alibaba Cloud. You are a helpful assistant.

With

You are Qwen, created by Alibaba Cloud. You are a helpful assistant. Currently you are roleplaying as [your text here]


About

No changes to datasets or capabilities. Fully functional - 100% of what the original authors intended, just without refusals and with the critical architecture bug fixed on output layers.

These are meant to be the best lossless uncensored models out there.


Specs

  • 35B total parameters, ~3B active per forward pass (MoE)
  • 256 experts, 8 routed + 1 shared per token
  • Hybrid architecture: Gated DeltaNet linear attention + full softmax attention (3:1 ratio)
  • 40 layers, pattern: 10 × (3 × DeltaNet-MoE + 1 × Attention-MoE)
  • 262K native context (extendable to 1M with YaRN)
  • Natively multimodal (text, image, video)
  • Multi-token prediction (MTP) support
  • 248K vocabulary, 201 languages
  • Based on Qwen/Qwen3.5-35B-A3B

Recommended Settings (Official Qwen Authors)

Thinking mode (default):

  • General: temperature=1.0, top_p=0.95, top_k=20, min_p=0, presence_penalty=1.5
  • Coding/precise tasks: temperature=0.6, top_p=0.95, top_k=20, min_p=0, presence_penalty=0

Non-thinking mode:

  • General: temperature=0.7, top_p=0.8, top_k=20, min_p=0, presence_penalty=1.5
  • Reasoning tasks: temperature=1.0, top_p=1.0, top_k=40, min_p=0, presence_penalty=2.0

Important:

  • Keep at least 128K context to preserve thinking capabilities
  • Use --jinja flag with llama.cpp for proper chat template handling
  • Vision support requires the mmproj file alongside the main GGUF

Compatibility

Works with llama.cpp, LM Studio, koboldcpp, and other GGUF-compatible runtimes.

Author: LuffyTheFox

Likes: 5

Downloads: 0

Tags: gguf, uncensored, qwen3.5, moe, vision, multimodal, image-text-to-text, conversational, en, zh, multilingual, base_model:Qwen/Qwen3.5-35B-A3B, base_model:quantized:Qwen/Qwen3.5-35B-A3B, license:apache-2.0, endpoints_compatible, region:us