Todays AI Summary

AI Developments: Robust Navigation, Live Streaming Datasets, and LLM Persuasion Dynamics

Today's AI landscape features advancements in several key areas, including robotics, recommendation systems, and language model analysis. Research papers explore methods for improving robot navigation in crowded environments, introduce a new dataset for live streaming recommendation, and investigate persuasion dynamics in large language models. New models focus on quantization and merging techniques.

Research Highlights

  • Generalizable Safety in Crowd Navigation: A paper titled "Towards Generalizable Safety in Crowd Navigation via Conformal Uncertainty Handling" introduces a method that uses uncertainty estimates to enable robots to navigate safely and robustly in crowded environments, even when faced with unexpected situations. The method achieves significantly higher success rates and fewer collisions compared to existing approaches.
  • Real-time Interactive Dataset for Live Streaming Recommendation: The "KuaiLive" dataset, detailed in the paper "KuaiLive: A Real-time Interactive Dataset for Live Streaming Recommendation," offers a valuable resource for researchers working on live streaming recommendation systems. Collected from Kuaishou, a major live streaming platform in China, the dataset includes real-time user interactions and rich side information, enabling more realistic simulations and better modeling of user behavior.
  • Understanding LLM Persuasion: The paper "How Do LLMs Persuade? Linear Probes Can Uncover Persuasion Dynamics in Multi-Turn Conversations" explores how large language models (LLMs) persuade humans in conversations. By using linear probes, the researchers were able to identify key aspects of persuasion, such as the point in a conversation where someone is persuaded and the strategies used to achieve persuasion.
  • Tokenizer-Free Language Modeling: "H-Net++: Hierarchical Dynamic Chunking for Tokenizer-Free Language Modelling in Morphologically-Rich Languages" introduces a new model that learns linguistically-informed segmentation through end-to-end training, achieving state-of-the-art results on a Persian corpus.
  • LLM-Empowered Agents for Simulating Human Learning: "Simulating Human-Like Learning Dynamics with LLM-Empowered Agents" introduces LearnerAgent, a multi-agent framework based on LLMs to simulate a realistic teaching environment and explore human-like learning dynamics.
  • Active Inference in AI Agents: "The Missing Reward: Active Inference in the Era of Experience" argues that Active Inference (AIF) provides a crucial foundation for developing autonomous AI agents capable of learning from experience without continuous human reward engineering.
  • Trajectory Prediction Heuristics Design via LLM: "TrajEvo: Trajectory Prediction Heuristics Design via LLM-driven Evolution" introduces a framework that leverages Large Language Models (LLMs) to automatically design trajectory prediction heuristics.
  • Test-Time Reinforcement Learning for GUI Grounding: "Test-Time Reinforcement Learning for GUI Grounding via Region Consistency" proposes GUI-RC (Region Consistency), a test-time scaling method that constructs spatial voting grids from multiple sampled predictions to identify consensus regions where models show highest agreement.
  • Benchmarking Agent Reasoning in Embodied Tasks: "OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks" presents a comprehensive framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks.
  • Co-Optimizing Policy and Reward Models in Reinforcement Learning: "Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models" introduces a RL framework that jointly optimizes both the policy model and the reward model.

Model Updates

  • Impish_Nemo_12B Variants: Several new models based on "Impish_Nemo_12B" have been released, focusing on GGUF quantization for improved efficiency. These include versions optimized for ARM architectures and iMatrix quantization.
  • Kitsune-Symphony-V0.0-12B: This model is a merge of several existing models, including "ChatWaifu_12B_v2.0" and "Mistral-Nemo-Base-2407-chatml," using the Linear DELLA merge method. It is designed for roleplay and conversational tasks.
  • LVFace: Bytedance Research has released "LVFace," a model for face recognition based on Large Vision Transformers. It secured 1st place in the ICCV 2021 Masked Face Recognition (MFR)-Ongoing Challenge and has been recommended as an ICCV Highlight.
  • OpenAi-GPT-oss-20b-MODERATE-uncensored-NEO-Imatrix-gguf: This model is a "moderate" uncensored quant of the new OpenAI 20B MOE, designed to reduce refus

AI Papers for 2026-04-18

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

The rapid progress of Artificial Intelligence Generated Content (AIGC) tools enables images, videos, and visualizations to be created on demand for webpage design, offering a flexible and increasingly adopted paradigm for modern UI/UX. However, directly integrating such tools into automated webpage generation often leads to style inconsistency and poor global coherence, as elements are generated in isolation. We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection. MM-WebAgent jointly optimizes global layout, local multimodal content, and their integration, producing coherent and visually consistent webpages. We further introduce a benchmark for multimodal webpage generation and a multi-level evaluation protocol for systematic assessment. Experiments demonstrate that MM-WebAgent outperforms code-generation and agent-based baselines, especially on multimodal element generation and integration. Code & Data: https://aka.ms/mm-webagent.

Generalization in LLM Problem Solving: The Case of the Shortest Path

Whether language models can systematically generalize remains actively debated. Yet empirical performance is jointly shaped by multiple factors such as training data, training paradigms, and inference-time strategies, making failures difficult to interpret. We introduce a controlled synthetic environment based on shortest-path planning, a canonical composable sequential optimization problem. The setup enables clean separation of these factors and supports two orthogonal axes of generalization: spatial transfer to unseen maps and length scaling to longer-horizon problems. We find that models exhibit strong spatial transfer but consistently fail under length scaling due to recursive instability. We further analyze how distinct stages of the learning pipeline influence systematic problem-solving: for example, data coverage sets capability limits; reinforcement learning improves training stability but does not expand those limits; and inference-time scaling enhances performance but cannot rescue length-scaling failures.

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

LLM-as-judge frameworks are increasingly used for automatic NLG evaluation, yet their per-instance reliability remains poorly understood. We present a two-pronged diagnostic toolkit applied to SummEval: $\textbf{(1)}$ a transitivity analysis that reveals widespread per-input inconsistency masked by low aggregate violation rates ($\barρ = 0.8$-$4.1\%$), with $33$-$67\%$ of documents exhibiting at least one directed 3-cycle; and $\textbf{(2)}$ split conformal prediction sets over 1-5 Likert scores providing theoretically-guaranteed $\geq(1{-}α)$ coverage, with set width serving as a per-instance reliability indicator ($r_s = {+}0.576$, $N{=}1{,}918$, $p < 10^{-100}$, pooled across all judges). Critically, prediction set width shows consistent cross-judge agreement ($\bar{r} = 0.32$-$0.38$), demonstrating it captures document-level difficulty rather than judge-specific noise. Across four judges and four criteria, both diagnostics converge: criterion matters more than judge, with relevance judged most reliably (avg. set size $\approx 3.0$) and coherence moderately so (avg. set size $\approx 3.9$), while fluency and consistency remain unreliable (avg. set size $\approx 4.9$). We release all code, prompts, and cached results.

How Do LLMs and VLMs Understand Viewpoint Rotation Without Vision? An Interpretability Study

Over the past year, spatial intelligence has drawn increasing attention. Many prior works study it from the perspective of visual-spatial intelligence, where models have access to visuospatial information from visual inputs. However, in the absence of visual information, whether linguistic intelligence alone is sufficient to endow models with spatial intelligence, and how models perform relevant tasks with text-only inputs still remain unexplored. Therefore, in this paper, we focus on a fundamental and critical capability in spatial intelligence from a linguistic perspective: viewpoint rotation understanding (VRU). Specifically, LLMs and VLMs are asked to infer their final viewpoint and predict the corresponding observation in an environment given textual description of viewpoint rotation and observation over multiple steps. We find that both LLMs and VLMs perform poorly on our proposed dataset while human can easily achieve 100% accuracy, indicating a substantial gap between current model capabilities and the requirements of spatial intelligence. To uncover the underlying mechanisms, we conduct a layer-wise probing analysis and head-wise causal intervention. Our findings reveal that although models encode viewpoint information in the hidden states, they appear to struggle to bind the viewpoint position with corresponding observation, resulting in a hallucination in final layers. Finally, we selectively fine-tune the key attention heads identified by causal intervention to improve VRU performance. Experimental results demonstrate that such selective fine-tuning achieves improved VRU performance while avoiding catastrophic forgetting of generic abilities. Our dataset and code will be released at https://github.com/Young-Zhen/VRU_Interpret .

AD4AD: Benchmarking Visual Anomaly Detection Models for Safer Autonomous Driving

The reliability of a machine vision system for autonomous driving depends heavily on its training data distribution. When a vehicle encounters significantly different conditions, such as atypical obstacles, its perceptual capabilities can degrade substantially. Unlike many domains where errors carry limited consequences, failures in autonomous driving translate directly into physical risk for passengers, pedestrians, and other road users. To address this challenge, we explore Visual Anomaly Detection (VAD) as a solution. VAD enables the identification of anomalous objects not present during training, allowing the system to alert the driver when an unfamiliar situation is detected. Crucially, VAD models produce pixel-level anomaly maps that can guide driver attention to specific regions of concern without requiring any prior assumptions about the nature or form of the hazard. We benchmark eight state-of-the-art VAD methods on AnoVox, the largest synthetic dataset for anomaly detection in autonomous driving. In particular, we evaluate performance across four backbone architectures spanning from large networks to lightweight ones such as MobileNet and DeiT-Tiny. Our results demonstrate that VAD transfers effectively to road scenes. Notably, Tiny-Dinomaly achieves the best accuracy-efficiency trade-off for edge deployment, matching full-scale localization performance at a fraction of the memory cost. This study represents a concrete step toward safer, more responsible deployment of autonomous vehicles, ultimately improving protection for passengers, pedestrians, and all road users.

Why Do Vision Language Models Struggle To Recognize Human Emotions?

Understanding emotions is a fundamental ability for intelligent systems to be able to interact with humans. Vision-language models (VLMs) have made tremendous progress in the last few years for many visual tasks, potentially offering a promising solution for understanding emotions. However, it is surprising that even the most sophisticated contemporary VLMs struggle to recognize human emotions or to outperform even specialized vision-only classifiers. In this paper we ask the question "Why do VLMs struggle to recognize human emotions?", and observe that the inherently continuous and dynamic task of facial expression recognition (DFER) exposes two critical VLM vulnerabilities. First, emotion datasets are naturally long-tailed, and the web-scale data used to pre-train VLMs exacerbates this head-class bias, causing them to systematically collapse rare, under-represented emotions into common categories. We propose alternative sampling strategies that prevent favoring common concepts. Second, temporal information is critical for understanding emotions. However, VLMs are unable to represent temporal information over dense frame sequences, as they are limited by context size and the number of tokens that can fit in memory, which poses a clear challenge for emotion recognition. We demonstrate that the sparse temporal sampling strategy used in VLMs is inherently misaligned with the fleeting nature of micro-expressions (0.25-0.5 seconds), which are often the most critical affective signal. As a diagnostic probe, we propose a multi-stage context enrichment strategy that utilizes the information from "in-between" frames by first converting them into natural language summaries. This enriched textual context is provided as input to the VLM alongside sparse keyframes, preventing attentional dilution from excessive visual data while preserving the emotional trajectory.

Prism: Symbolic Superoptimization of Tensor Programs

This paper presents Prism, the first symbolic superoptimizer for tensor programs. The key idea is sGraph, a symbolic, hierarchical representation that compactly encodes large classes of tensor programs by symbolically representing some execution parameters. Prism organizes optimization as a two-level search: it constructs symbolic graphs that represent families of programs, and then instantiates them into concrete implementations. This formulation enables structured pruning of provably suboptimal regions of the search space using symbolic reasoning over operator semantics, algebraic identities, and hardware constraints. We develop techniques for efficient symbolic graph generation, equivalence verification via e-graph rewriting, and parameter instantiation through auto-tuning. Together, these components allow Prism to bridge the rigor of exhaustive search with the scalability required for modern ML workloads. Evaluation on five commonly used LLM workloads shows that Prism achieves up to $2.2\times$ speedup over best superoptimizers and $4.9\times$ over best compiler-based approaches, while reducing end-to-end optimization time by up to $3.4\times$.

SegWithU: Uncertainty as Perturbation Energy for Single-Forward-Pass Risk-Aware Medical Image Segmentation

Reliable uncertainty estimation is critical for medical image segmentation, where automated contours feed downstream quantification and clinical decision support. Many strong uncertainty methods require repeated inference, while efficient single-forward-pass alternatives often provide weaker failure ranking or rely on restrictive feature-space assumptions. We present $\textbf{SegWithU}$, a post-hoc framework that augments a frozen pretrained segmentation backbone with a lightweight uncertainty head. SegWithU taps intermediate backbone features and models uncertainty as perturbation energy in a compact probe space using rank-1 posterior probes. It produces two voxel-wise uncertainty maps: a calibration-oriented map for probability tempering and a ranking-oriented map for error detection and selective prediction. Across ACDC, BraTS2024, and LiTS, SegWithU is the strongest and most consistent single-forward-pass baseline, achieving AUROC/AURC of $0.9838/2.4885$, $0.9946/0.2660$, and $0.9925/0.8193$, respectively, while preserving segmentation quality. These results suggest that perturbation-based uncertainty modeling is an effective and practical route to reliability-aware medical segmentation. Source code is available at https://github.com/ProjectNeura/SegWithU.

CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas

It is increasingly important that LLM agents interact effectively and safely with other goal-pursuing agents, yet, recent works report the opposite trend: LLMs with stronger reasoning capabilities behave _less_ cooperatively in mixed-motive games such as the prisoner's dilemma and public goods settings. Indeed, our experiments show that recent models -- with or without reasoning enabled -- consistently defect in single-shot social dilemmas. To tackle this safety concern, we present the first comparative study of game-theoretic mechanisms that are designed to enable cooperative outcomes between rational agents _in equilibrium_. Across four social dilemmas testing distinct components of robust cooperation, we evaluate the following mechanisms: (1) repeating the game for many rounds, (2) reputation systems, (3) third-party mediators to delegate decision making to, and (4) contract agreements for outcome-conditional payments between players. Among our findings, we establish that contracting and mediation are most effective in achieving cooperative outcomes between capable LLM models, and that repetition-induced cooperation deteriorates drastically when co-players vary. Moreover, we demonstrate that these cooperation mechanisms become _more effective_ under evolutionary pressures to maximize individual payoffs.

Stability and Generalization in Looped Transformers

Looped transformers promise test-time compute scaling by spending more iterations on harder problems, but it remains unclear which architectural choices let them extrapolate to harder problems at test time rather than memorize training-specific solutions. We introduce a fixed-point based framework for analyzing looped architectures along three axes of stability -- reachability, input-dependence, and geometry -- and use it to characterize when fixed-point iteration yields meaningful predictions. Theoretically, we prove that looped networks without recall have countable fixed points and cannot achieve strong input-dependence at any spectral regime, while recall combined with outer normalization reliably produces a regime in which fixed points are simultaneously reachable, locally smooth in the input, and supported by stable backpropagation. Empirically, we train single-layer looped transformers on chess, sudoku, and prefix-sums and find that downstream performance tracks the framework's predictions across tasks and architectural configurations. We additionally introduce internal recall, a novel recall placement variant, and show that it becomes competitive with -- and on sudoku, substantially better than -- standard recall placement once outer normalization is applied.

AI Models

RedHatAI/Qwen3.6-35B-A3B-NVFP4


base_model:

  • Qwen/Qwen3.6-35B-A3B tags:
  • qwen
  • nvfp4
  • vllm
  • compressed-tensors name: RedHatAI/Qwen3.6-35B-A3B-NVFP4

NVFP4 Quantized RedHatAI/Qwen3.6-35B-A3B-NVFP4

This is a preliminary version (and subject to change) of NVFP4 quantized Qwen/Qwen3.6-35B-A3B model. The model has both weights and activations quantized to NVFP4 format with vllm-project/llm-compressor.

It is compatible and tested against vllm main. Deploy it with: vllm serve RedHatAI/Qwen3.6-35B-A3B-NVFP4 --reasoning-parser qwen3 --moe_backend flashinfer_cutlass

Preliminary Evaluations

  1. GSM8K Platinum:
lm_eval --model local-chat-completions \
  --tasks gsm8k_platinum_cot_llama \
  --model_args "model=RedHatAI/Qwen3.6-35B-A3B-NVFP4,max_length=262144,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=1200" \
  --num_fewshot 0 \
  --apply_chat_template \
  --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=20,min_p=0.0,max_gen_toks=64000,presence_penalty=1.5,repetition_penalty=1.0,seed=5678"


Recovery:

| | Qwen/Qwen3.6-35B-A3B | RedHatAI/Qwen3.6-35B-A3B-NVFP4<br> (this model) | | -------- | :--------------------: | :------------------------------------: | | Accuracy | 95.62 | 96.28 | | Recovery | - | 100.69% |

Note: More rigorous evaluations are currently in progress and will be available soon.

Author: RedHatAI

Likes: 25

Downloads: 522

Tags: safetensors, qwen3_5_moe, qwen, nvfp4, vllm, compressed-tensors, base_model:Qwen/Qwen3.6-35B-A3B, base_model:quantized:Qwen/Qwen3.6-35B-A3B, region:us

Jiunsong/supergemma4-e4b-abliterated


license: gemma library_name: transformers base_model:

  • google/gemma-4-E4B-it tags:
  • gemma
  • text-generation
  • instruction-tuned
  • tool-calling
  • structured-output
  • vllm pipeline_tag: text-generation

SuperGemma4 E4B Abliterated

supergemma4-e4b-abliterated is a private evaluation release whose original upstream base is google/gemma-4-E4B-it.

This SuperGemma release is an abliterated and tuned derivative of that Google E4B base, with additional work for higher release quality, stronger formatting discipline, better code output, and faster time to first token.

This branch is aimed at users who want:

  • strong code and bug-fix behavior
  • clean JSON and tool-call formatting
  • fast first-token responsiveness
  • release-ready serving behavior on Transformers and OpenAI-compatible stacks

Why This Build Exists

The original Google checkpoint provides the core Gemma 4 E4B capability base. This project line uses an abliterated release path to reduce refusal-heavy behavior, but that kind of modification can regress on exact formatting, tool-call reliability, and service stability if it is not carefully hardened.

This release focuses on recovering and then surpassing baseline quality where it matters for real usage:

  • exact structured outputs
  • code correctness
  • bug-fix reliability
  • server-facing stability
  • low-friction deployment on Transformers and OpenAI-compatible serving stacks

Highlights

  • Release-quality score: 92.34
  • Exact-eval score: 98.50
  • Broad-eval score: 83.10
  • JSON exact-match: 100%
  • Tool-call accuracy: 90%
  • Exact code score: 100%
  • Exact bug-fix score: 100%
  • Long-context sanity: 100%
  • TTFT: 2291 ms
  • PREFILL: 2479.70 tok/s
  • DECODE: 42.04 tok/s

Lineage

  1. Original upstream base: google/gemma-4-E4B-it
  2. Abliterated and tuned release: Jiunsong/supergemma4-e4b-abliterated

Comparison Snapshot

Measured against the same evaluation harness used for:

  • google/gemma-4-E4B-it

| Model | Release Quality | Exact Overall | JSON | Tool | Code | Bugfix | TTFT ms | PREFILL tok/s | DECODE tok/s | |---|---:|---:|---:|---:|---:|---:|---:|---:|---:| | Google base | 77.46 | 83.50 | 50.0 | 90.0 | 62.5 | 100.0 | 4827.31 | 2456.69 | 42.04 | | SuperGemma4 E4B Abliterated | 92.34 | 98.50 | 100.0 | 90.0 | 100.0 | 100.0 | 2291.23 | 2479.70 | 42.04 |

Stability Notes

This candidate was release-hardened against the failure modes that matter in real serving:

  • batched OpenAI-compatible serving restored
  • simple OpenAI-compatible serving restored
  • unicode output verified
  • tool-calling output verified
  • empty-response false-green cases blocked by stricter tests

Validation highlights:

  • direct reliability audit: 14/14
  • repeat reliability probe: 90/90
  • batched soak test: 12/12
  • simple soak test: 6/6

Recommended Use Cases

  • coding assistant
  • bug-fix assistant
  • strict JSON and schema outputs
  • agent backends that depend on tool-call formatting
  • standard BF16 deployment on Hugging Face / Transformers stacks

Quick Start

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "Jiunsong/supergemma4-e4b-abliterated"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "user", "content": "Write a compact Python function that groups words by length."}
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.device)

with torch.no_grad():
    outputs = model.generate(inputs, max_new_tokens=256)

print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))

Serving

This checkpoint is designed to work well with:

  • Transformers
  • vLLM-style OpenAI-compatible stacks

Release Positioning

This private release is the strongest all-around E4B candidate in the current project line for users who want the abliterated base behavior without giving up quality recovery, formatting discipline, or serving readiness.

Author: Jiunsong

Likes: 17

Downloads: 0

Tags: transformers, safetensors, gemma4, image-text-to-text, gemma, text-generation, instruction-tuned, tool-calling, structured-output, vllm, conversational, base_model:google/gemma-4-E4B-it, base_model:finetune:google/gemma-4-E4B-it, license:gemma, endpoints_compatible, region:us

z-lab/Qwen3.6-35B-A3B-DFlash

Author: z-lab

Likes: 10

Downloads: 4

Tags: transformers, safetensors, qwen3, feature-extraction, dflash, speculative-decoding, block-diffusion, draft-model, efficiency, qwen, diffusion-language-model, text-generation, custom_code, arxiv:2602.06036, license:mit, text-generation-inference, endpoints_compatible, region:us

QuantTrio/Qwen3.6-35B-A3B-AWQ


library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3.6-35B-A3B/blob/main/LICENSE pipeline_tag: image-text-to-text tags:

  • vLLM
  • AWQ base_model:
    • Qwen/Qwen3.6-35B-A3B base_model_relation: quantized

Qwen3.6-35B-A3B-AWQ

Base model: Qwen/Qwen3.6-35B-A3B

This repo quantizes the model using data-free quantization tool.

(no calibration dataset was involved)

【Dependencies / Installation】

vllm>=0.19.0
transformers>=5.5.4

As of 2026-04-16, make sure your system has cuda12.8 or cuda13.0 installed.

Then, create a fresh Python environment (e.g. python3.12 venv) and run:

pip install vllm==0.19.0
pip install transformers==5.5.4

vLLM Official Guide

【vLLM Startup Command】

<i>Note: When launching with TP=8, include --enable-expert-parallel; otherwise the expert tensors wouldn’t be evenly sharded across GPU devices.</i>

export VLLM_SLEEP_WHEN_IDLE=1
export VLLM_USE_DEEP_GEMM=0
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_SAMPLER=0
export OMP_NUM_THREADS=4

vllm serve \
    __YOUR_PATH__/tclf90/Qwen3.6-35B-A3B-AWQ \
    --served-model-name MY_MODEL \
    --swap-space 16 \
    --max-num-seqs 32 \
    --max-model-len 32768  \
    --gpu-memory-utilization 0.9 \
    --tensor-parallel-size 4 \
    --enable-expert-parallel \ 
    --enable-auto-tool-choice \    
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000

【Logs】

2026-04-16
1. Initial commit

【Model Files】

| File Size | Last Updated | |-----------|--------------| | 24GiB | 2026-04-16 |

【Model Download】

from modelscope import snapshot_download
snapshot_download('tclf90/Qwen3.6-35B-A3B-AWQ', cache_dir="your_local_path")

【Overview】

Qwen3.6-35B-A3B

<img width="400px" src="https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.6/logo.png">

Qwen Chat

[!Note] This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format.

These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, KTransformers, etc.

Following the February release of the Qwen3.5 series, we're pleased to share the first open-weight variant of Qwen3.6. Built on direct feedback from the community, Qwen3.6 prioritizes stability and real-world utility, offering developers a more intuitive, responsive, and genuinely productive coding experience.

Qwen3.6 Highlights

This release delivers substantial upgrades, particularly in

  • Agentic Coding: the model now handles frontend workflows and repository-level reasoning with greater fluency and precision.
  • Thinking Preservation: we've introduced a new option to retain reasoning context from historical messages, streamlining iterative development and reducing overhead.

Benchmark Results

For more details, please refer to our blog post Qwen3.6-35B-A3B.

Model Overview

  • Type: Causal Language Model with Vision Encoder
  • Training Stage: Pre-training & Post-training
  • Language Model
    • Number of Parameters: 35B in total and 3B activated
    • Hidden Dimension: 2048
    • Token Embedding: 248320 (Padded)
    • Number of Layers: 40
    • Hidden Layout: 10 × (3 × (Gated DeltaNet → MoE) → 1 × (Gated Attention → MoE))
    • Gated DeltaNet:
      • Number of Linear Attention Heads: 32 for V and 16 for QK
      • Head Dimension: 128
    • Gated Attention:
      • Number of Attention Heads: 16 for Q and 2 for KV
      • Head Dimension: 256
      • Rotary Position Embedding Dimension: 64
    • Mixture Of Experts
      • Number of Experts: 256
      • Number of Activated Experts: 8 Routed + 1 Shared
      • Expert Intermediate Dimension: 512
    • LM Output: 248320 (Padded)
    • MTP: trained with multi-steps
  • Context Length: 262,144 natively and extensible up to 1,010,000 tokens.

Benchmark Results

Language

<div style="font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,sans-serif;max-width:1000px;margin:0 auto;padding:16px 0"> <table style="width:100%;border-collapse:collapse;font-size:13px"> <thead><tr> <th style="padding:10px 7px;text-align:left;font-weight:600;border-bottom:2px solid #7c3aed;color:#7c3aed"></th><th style="padding:10px 7px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">Qwen3.5-27B</th><th style="padding:10px 7px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">Gemma4-31B</th><th style="padding:10px 7px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">Qwen3.5-35BA3B</th><th style="padding:10px 7px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">Gemma4-26BA4B</th><th style="padding:10px 7px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">Qwen3.6-35BA3B</th></tr></thead> <tbody> <tr><td colspan="6" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">Coding Agent</td></tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">SWE-bench Verified</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">75.0</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">52.0</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.0</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">17.4</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">73.4</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">SWE-bench Multilingual</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">69.3</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">51.7</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">60.3</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">17.3</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">67.2</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">SWE-bench Pro</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">51.2</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">35.7</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">44.6</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">13.8</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">49.5</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">Terminal-Bench 2.0</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">41.6</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">42.9</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">40.5</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">34.2</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">51.5</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">Claw-Eval <sub><small>Avg</small></sub></td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">64.3</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">48.5</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">65.4</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">58.8</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.7</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">Claw-Eval <sub><small>Pass^3</small></sub></td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">46.2</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">25.0</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">51.0</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">28.0</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">50.0</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">SkillsBench <sub><small>Avg5</small></sub></td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">27.2</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">23.6</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">4.4</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">12.3</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">28.7</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">QwenClawBench</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">52.2</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">41.7</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">47.7</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">38.7</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">52.6</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">NL2Repo</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">27.3</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">15.5</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">20.5</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">11.6</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">29.4</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">QwenWebBench</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">1068</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">1197</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">978</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">1178</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">1397</td> </tr> <tr><td colspan="6" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">General Agent</td></tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">TAU3-Bench</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.4</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">67.5</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.9</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">59.0</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">67.2</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">VITA-Bench</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">41.8</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">43.0</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">29.1</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">36.9</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">35.6</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">DeepPlanning</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">22.6</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">24.0</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">22.8</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">16.2</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">25.9</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">Tool Decathlon</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">31.5</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">21.2</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">28.7</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">12.0</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">26.9</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MCPMark</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">36.3</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">18.1</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">27.0</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">14.2</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">37.0</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MCP-Atlas</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.4</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">57.2</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">62.4</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">50.0</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">62.8</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">WideSearch</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">66.4</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">35.2</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">59.1</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">38.3</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">60.1</td> </tr> <tr><td colspan="6" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">Knowledge</td></tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MMLU-Pro</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.1</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.2</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.3</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">82.6</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.2</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MMLU-Redux</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.2</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.7</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.3</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">92.7</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.3</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">SuperGPQA</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">65.6</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">65.7</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">63.4</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">61.4</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">64.7</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">C-Eval</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.5</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">82.6</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.2</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">82.5</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.0</td> </tr> <tr><td colspan="6" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">STEM & Reasoning</td></tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">GPQA</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.5</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.3</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.2</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">82.3</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.0</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">HLE</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">24.3</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">19.5</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">22.4</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">8.7</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">21.4</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">LiveCodeBench v6</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.7</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.0</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">74.6</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.1</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.4</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">HMMT Feb 25</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">92.0</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">88.7</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">89.0</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">91.7</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.7</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">HMMT Nov 25</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">89.8</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.5</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">89.2</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.5</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">89.1</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">HMMT Feb 26</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.3</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.2</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">78.7</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">79.0</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.6</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">IMOAnswerBench</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">79.9</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">74.5</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">76.8</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">74.3</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">78.9</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">AIME26 </td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">92.6</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">89.2</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">91.0</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">88.3</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">92.7</td> </tr> </tbody> </table> <p style="margin-top:12px;font-size:10px;opacity:0.7"> * SWE-Bench Series: Internal agent scaffold (bash + file-edit tools); temp=1.0, top_p=0.95, 200K context window. We correct some problematic tasks in the public set of SWE-bench Pro and evaluate all baselines on the refined benchmark.<br/> * Terminal-Bench 2.0: Harbor/Terminus-2 harness; 3h timeout, 32 CPU/48 GB RAM; temp=1.0, top_p=0.95, top_k=20, max_tokens=80K, 256K ctx; avg of 5 runs.<br/> * SkillsBench: Evaluated via OpenCode on 78 tasks (self-contained subset, excluding API-dependent tasks); avg of 5 runs.<br/> * NL2Repo: Others are evaluated via Claude Code (temp=1.0, top_p=0.95, max_turns=900).<br/> * QwenClawBench: An internal real-user-distribution Claw agent benchmark (open-sourcing soon); temp=0.6, 256K ctx.<br/> * QwenWebBench: An internal front-end code generation benchmark; bilingual (EN/CN), 7 categories (Web Design, Web Apps, Games, SVG, Data Visualization, Animation, and 3D); auto-render + multimodal judge (code/visual correctness); BT/Elo rating system.<br/> * TAU3-Bench: We use the official user model (gpt-5.2, low reasoning effort) + default BM25 retrieval.<br/> * VITA-Bench: Avg subdomain scores; using claude-4-sonnet as judger, as the official judger (claude-3.7-sonnet) is no longer available.<br/> * MCPMark: GitHub MCP v0.30.3; Playwright responses truncated at 32K tokens.<br/> * MCP-Atlas: Public set score; gemini-2.5-pro judger.<br/> * AIME 26: We use the full AIME 2026 (I & II), where the scores may differ from Qwen 3.5 notes.<br/> </p> </div>

Vision Language

<div style="font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,sans-serif;max-width:1000px;margin:0 auto;padding:16px 0"> <table style="width:100%;border-collapse:collapse;font-size:13px"> <thead><tr> <th style="padding:10px 7px;text-align:left;font-weight:600;border-bottom:2px solid #7c3aed;color:#7c3aed"></th><th style="padding:10px 7px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">Qwen3.5-27B</th><th style="padding:10px 7px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">Claude-Sonnet-4.5</th><th style="padding:10px 7px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">Gemma4-31B</th><th style="padding:10px 7px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">Gemma4-26BA4B</th><th style="padding:10px 7px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">Qwen3.5-35B-A3B</th><th style="padding:10px 7px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">Qwen3.6-35B-A3B</th></tr></thead> <tbody> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">STEM and Puzzle</td></tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MMMU</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">82.3</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">79.6</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.4</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">78.4</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.4</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.7</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MMMU-Pro</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">75.0</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.4</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">76.9*</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">73.8*</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">75.1</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">75.3</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">Mathvista(mini)</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.8</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">79.8</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">79.3</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">79.4</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.2</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.4</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">ZEROBench_sub</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">36.2</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">26.3</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">26.0</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">26.3</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">34.1</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">34.4</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">General VQA</td></tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">RealWorldQA</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.7</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.3</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">72.3</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">72.2</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.1</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.3</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MMBench<sub><small>EN-DEV-v1.1</small></sub></td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">92.6</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">88.3</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.9</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">89.0</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">91.5</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">92.8</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">SimpleVQA</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">56.0</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">57.6</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">52.9</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">52.2</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">58.3</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">58.9</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">HallusionBench</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.0</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">59.9</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">67.4</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">66.1</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">67.9</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">69.8</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">Text Recognition and Document Understanding</td></tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">OmniDocBench1.5</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">88.9</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.8</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.1</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">74.4</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">89.3</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">89.9</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">CharXiv(RQ)</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">79.5</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">67.2</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">67.9</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">69.0</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.5</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">78.0</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">CC-OCR</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.0</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.1</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">75.7</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">74.5</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.7</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.9</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">AI2D_TEST</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">92.9</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.0</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">89.0</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">88.3</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">92.6</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">92.7</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">Spatial Intelligence</td></tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">RefCOCO(avg)</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.9</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">89.2</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">92.0</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">ODInW13</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">41.1</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">42.6</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">50.8</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">EmbSpatialBench</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.5</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">71.8</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.1</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.3</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">RefSpatialBench</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">67.7</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">63.5</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">64.3</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">Video Understanding</td></tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">VideoMME<sub><small>(w sub.)</sub></small></td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.0</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.1</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.6</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.6</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">VideoMME<sub><small>(w/o sub.)</sub></small></td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">82.8</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">75.3</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">82.5</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">82.5</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">VideoMMMU</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">82.3</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.6</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.6</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">76.0</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.4</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.7</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MLVU</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.9</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">72.8</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.6</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.2</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MVBench</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">74.6</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">74.8</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">74.6</td> </tr> <tr> <td style="padding:7px 7px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">LVBench</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">73.6</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">71.4</td> <td style="padding:7px 7px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">71.4</td> </tr> </tbody> </table> <p style="margin-top:12px;font-size:10px;opacity:0.7"> * Empty cells (--) indicate scores not available or not applicable. </p> </div>

Quickstart

For streamlined integration, we recommend using Qwen3.6 via APIs. Below is a guide to use Qwen3.6 via OpenAI-compatible API.

Serving Qwen3.6

Qwen3.6 can be served via APIs with popular inference frameworks. In the following, we show example commands to launch OpenAI-Compatible API servers for Qwen3.6 models.

[!Important] Inference efficiency and throughput vary significantly across frameworks. We recommend using the latest framework versions to ensure optimal performance and compatibility. For production workloads or high-throughput scenarios, dedicated serving engines such as SGLang, KTransformers or vLLM are strongly recommended.

[!Important] The model has a default context length of 262,144 tokens. If you encounter out-of-memory (OOM) errors, consider reducing the context window. However, because Qwen3.6 leverages extended context for complex tasks, we advise maintaining a context length of at least 128K tokens to preserve thinking capabilities.

SGLang

SGLang is a fast serving framework for large language models and vision language models. sglang>=0.5.10 is recommended for Qwen3.6, which can be installed using the following command in a fresh environment:

uv pip install sglang[all]

See its documentation for more details.

The following will create API endpoints at http://localhost:8000/v1:

  • Standard Version: The following command can be used to create an API endpoint with maximum context length 262,144 tokens using tensor parallel on 8 GPUs.

    python -m sglang.launch_server --model-path Qwen/Qwen3.6-35B-A3B --port 8000 --tp-size 8 --mem-fraction-static 0.8 --context-length 262144 --reasoning-parser qwen3
    
  • Tool Use: To support tool use, you can use the following command.

    python -m sglang.launch_server --model-path Qwen/Qwen3.6-35B-A3B --port 8000 --tp-size 8 --mem-fraction-static 0.8 --context-length 262144 --reasoning-parser qwen3 --tool-call-parser qwen3_coder
    
  • Multi-Token Prediction (MTP): The following command is recommended for MTP:

    python -m sglang.launch_server --model-path Qwen/Qwen3.6-35B-A3B --port 8000 --tp-size 8 --mem-fraction-static 0.8 --context-length 262144 --reasoning-parser qwen3 --speculative-algo NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4
    

For detailed deployment guide, see the SGLang Qwen3.5 Cookbook.

vLLM

vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. vllm>=0.19.0 is recommended for Qwen3.6, which can be installed using the following command in a fresh environment:

uv pip install vllm --torch-backend=auto

See its documentation for more details.

The following will create API endpoints at http://localhost:8000/v1:

  • Standard Version: The following command can be used to create an API endpoint with maximum context length 262,144 tokens using tensor parallel on 8 GPUs.

    vllm serve Qwen/Qwen3.6-35B-A3B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 
    
  • Tool Call: To support tool use, you can use the following command.

    vllm serve Qwen/Qwen3.6-35B-A3B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder 
    
  • Multi-Token Prediction (MTP): The following command is recommended for MTP:

    vllm serve Qwen/Qwen3.6-35B-A3B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
    
  • Text-Only: The following command skips the vision encoder and multimodal profiling to free up memory for additional KV cache:

    vllm serve Qwen/Qwen3.6-35B-A3B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 --language-model-only
    

For detailed deployment guide, see the vLLM Qwen3.5 Recipe.

KTransformers

KTransformers is a flexible framework for experiencing cutting-edge LLM inference optimizations with CPU-GPU heterogeneous computing. For running Qwen3.6 with KTransformers, see the KTransformers Deployment Guide.

Hugging Face Transformers

Hugging Face Transformers contains a lightweight server which can be used for quick testing and moderate load deployment. The latest transformers is required for Qwen3.6:

pip install "transformers[serving]"

See its documentation for more details. Please also make sure torchvision and pillow are installed.

Then, run transformers serve to launch a server with API endpoints at http://localhost:8000/v1; it will place the model on accelerators if available:

transformers serve Qwen/Qwen3.6-35B-A3B --port 8000 --continuous-batching

Using Qwen3.6 via the Chat Completions API

The chat completions API is accessible via standard HTTP requests or OpenAI SDKs. Here, we show examples using the OpenAI Python SDK.

Before starting, make sure it is installed and the API key and the API base URL is configured, e.g.:

pip install -U openai

# Set the following accordingly
export OPENAI_BASE_URL="http://localhost:8000/v1"
export OPENAI_API_KEY="EMPTY"

[!Tip] We recommend using the following set of sampling parameters for generation

  • Thinking mode for general tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
  • Thinking mode for precise coding tasks (e.g. WebDev): temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0
  • Instruct (or non-thinking) mode for general tasks: temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
  • Instruct (or non-thinking) mode for reasoning tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

Please note that the support for sampling parameters varies according to inference frameworks.

[!Important] Qwen3.6 models operate in thinking mode by default, generating thinking content signified by <think>\n...</think>\n\n before producing the final responses. To disable thinking content and obtain direct response, refer to the examples here.

Text-Only Input

from openai import OpenAI
# Configured by environment variables
client = OpenAI()

messages = [
    {"role": "user", "content": "Type \"I love Qwen3.6\" backwards"},
]

chat_response = client.chat.completions.create(
    model="Qwen/Qwen3.6-35B-A3B",
    messages=messages,
    max_tokens=81920,
    temperature=1.0,
    top_p=0.95,
    presence_penalty=1.5,
    extra_body={
        "top_k": 20,
    }, 
)
print("Chat response:", chat_response)

Image Input

from openai import OpenAI
# Configured by environment variables
client = OpenAI()

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.5/demo/CI_Demo/mathv-1327.jpg"
                }
            },
            {
                "type": "text",
                "text": "The centres of the four illustrated circles are in the corners of the square. The two big circles touch each other and also the two little circles. With which factor do you have to multiply the radii of the little circles to obtain the radius of the big circles?\nChoices:\n(A) $\\frac{2}{9}$\n(B) $\\sqrt{5}$\n(C) $0.8 \\cdot \\pi$\n(D) 2.5\n(E) $1+\\sqrt{2}$"
            }
        ]
    }
]

response = client.chat.completions.create(
    model="Qwen/Qwen3.6-35B-A3B",
    messages=messages,
    max_tokens=81920,
    temperature=1.0,
    top_p=0.95,
    presence_penalty=1.5,
    extra_body={
        "top_k": 20,
    }, 
)
print("Chat response:", chat_response)

Video Input

from openai import OpenAI
# Configured by environment variables
client = OpenAI()

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video_url",
                "video_url": {
                    "url": "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.5/demo/video/N1cdUjctpG8.mp4"
                }
            },
            {
                "type": "text",
                "text": "How many porcelain jars were discovered in the niches located in the primary chamber of the tomb?"
            }
        ]
    }
]

# When vLLM is launched with `--media-io-kwargs '{"video": {"num_frames": -1}}'`,
# video frame sampling can be configured via `extra_body` (e.g., by setting `fps`).
# This feature is currently supported only in vLLM.
#
# By default, `fps=2` and `do_sample_frames=True`.
# With `do_sample_frames=True`, you can customize the `fps` value to set your desired video sampling rate.
response = client.chat.completions.create(
    model="Qwen/Qwen3.6-35B-A3B",
    messages=messages,
    max_tokens=81920,
    temperature=1.0,
    top_p=0.95,
    presence_penalty=1.5,
    extra_body={
        "top_k": 20,
        "mm_processor_kwargs": {"fps": 2, "do_sample_frames": True},
    }, 
)

print("Chat response:", chat_response)

Instruct (or Non-Thinking) Mode

[!Important] Qwen3.6 does not officially support the soft switch of Qwen3, i.e., /think and /nothink.

Qwen3.6 will think by default before response. You can obtain direct response from the model without thinking by configuring the API parameters. For example,

from openai import OpenAI
# Configured by environment variables
client = OpenAI()

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.6/demo/RealWorld/RealWorld-04.png"
                }
            },
            {
                "type": "text",
                "text": "Where is this?"
            }
        ]
    }
]

chat_response = client.chat.completions.create(
    model="Qwen/Qwen3.6-35B-A3B",
    messages=messages,
    max_tokens=32768,
    temperature=0.7,
    top_p=0.8,
    presence_penalty=1.5,
    extra_body={
        "top_k": 20,
        "chat_template_kwargs": {"enable_thinking": False},
    }, 
)
print("Chat response:", chat_response)

[!Note] If you are using APIs from Alibaba Cloud Model Studio, in addition to changing model, please use "enable_thinking": False instead of "chat_template_kwargs": {"enable_thinking": False}.

Preserve Thinking

By default, only the thinking blocks generated in handling the latest user message is retained, resulting in a pattern commonly as interleaved thinking. Qwen3.6 has been additionally trained to preserve and leverage thinking traces from historical messages. You can enable this behavior by setting the preserve_thinking option:

from openai import OpenAI
# Configured by environment variables
client = OpenAI()

messages = [...]

chat_response = client.chat.completions.create(
    model="Qwen/Qwen3.6-35B-A3B",
    messages=messages,
    max_tokens=32768,
    temperature=0.7,
    top_p=0.8,
    presence_penalty=1.5,
    extra_body={
        "top_k": 20,
        "chat_template_kwargs": {"preserve_thinking": True},
    }, 
)
print("Chat response:", chat_response)

[!Note] If you are using APIs from Alibaba Cloud Model Studio, in addition to changing model, please use "preserve_thinking": True instead of "chat_template_kwargs": {"preserve_thinking": False}.

This capability is particularly beneficial for agent scenarios, where maintaining full reasoning context can enhance decision consistency and, in many cases, reduce overall token consumption by minimizing redundant reasoning. Additionally, it can improve KV cache utilization, optimizing inference efficiency in both thinking and non-thinking modes.

Agentic Usage

Qwen3.6 excels in tool calling capabilities.

Qwen-Agent

We recommend using Qwen-Agent to quickly build Agent applications with Qwen3.6.

To define the available tools, you can use the MCP configuration file, use the integrated tool of Qwen-Agent, or integrate other tools by yourself.

import os
from qwen_agent.agents import Assistant

# Define LLM
# Using Alibaba Cloud Model Studio
llm_cfg = {
    # Use the OpenAI-compatible model service provided by DashScope:
    'model': 'Qwen3.6-35B-A3B',
    'model_type': 'qwenvl_oai',
    'model_server': 'https://dashscope.aliyuncs.com/compatible-mode/v1',
    'api_key': os.getenv('DASHSCOPE_API_KEY'),

    'generate_cfg': {
        'use_raw_api': True,
        # When using Dash Scope OAI API, pass the parameter of whether to enable thinking mode in this way
        'extra_body': {
            'enable_thinking': True,
            'preserve_thinking': True,
        },
    },
}

# Using OpenAI-compatible API endpoint.
# functionality of the deployment frameworks and let Qwen-Agent automate the related operations.
#
# llm_cfg = {
#     # Use your own model service compatible with OpenAI API by vLLM/SGLang:
#     'model': 'Qwen/Qwen3.6-35B-A3B',
#     'model_type': 'qwenvl_oai',
#     'model_server': 'http://localhost:8000/v1',  # api_base
#     'api_key': 'EMPTY',
#
#     'generate_cfg': {
#         'use_raw_api': True,
#         # When using vLLM/SGLang OAI API, pass the parameter of whether to enable thinking mode in this way
#         'extra_body': {
#             'chat_template_kwargs': {'enable_thinking': True, 'preserve_thinking': True}
#         },
#     },
# }

# Define Tools
tools = [
    {'mcpServers': {  # You can specify the MCP configuration file
            "filesystem": {
                "command": "npx",
                "args": ["-y", "@modelcontextprotocol/server-filesystem", "/Users/xxxx/Desktop"]
            }
        }
    }
]

# Define Agent
bot = Assistant(llm=llm_cfg, function_list=tools)

# Streaming generation
messages = [{'role': 'user', 'content': 'Help me organize my desktop.'}]
for responses in bot.run(messages=messages):
    pass
print(responses)

# Streaming generation
messages = [{'role': 'user', 'content': 'Develop a dog website and save it on the desktop'}]
for responses in bot.run(messages=messages):
    pass
print(responses)

Qwen Code

Qwen Code is an open-source AI agent for the terminal, optimized for Qwen models. It helps you understand large codebases, automate tedious work, and ship faster.

For more information, please refer to Qwen Code.

Processing Ultra-Long Texts

Qwen3.6 natively supports context lengths of up to 262,144 tokens. For long-horizon tasks where the total length (including both input and output) exceeds this limit, we recommend using RoPE scaling techniques to handle long texts effectively., e.g., YaRN.

YaRN is currently supported by several inference frameworks, e.g., transformers, vllm, ktransformers and sglang. In general, there are two approaches to enabling YaRN for supported frameworks:

  • Modifying the model configuration file: In the config.json file, change the rope_parameters fields in text_config to:

    {
        "mrope_interleaved": true,
        "mrope_section": [
            11,
            11,
            10
        ],
        "rope_type": "yarn",
        "rope_theta": 10000000,
        "partial_rotary_factor": 0.25,
        "factor": 4.0,
        "original_max_position_embeddings": 262144,
    }
    
  • Passing command line arguments:

    For vllm, you can use

    VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve ... --hf-overrides '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}' --max-model-len 1010000  
    

    For sglang and ktransformers, you can use

    SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python -m sglang.launch_server ... --json-model-override-args '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}' --context-length 1010000
    

[!NOTE] All the notable open-source frameworks implement static YaRN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise modifying the rope_parameters configuration only when processing long contexts is required. It is also recommended to modify the factor as needed. For example, if the typical context length for your application is 524,288 tokens, it would be better to set factor as 2.0.

Best Practices

To achieve optimal performance, we recommend the following settings:

  1. Sampling Parameters:

    • We suggest using the following sets of sampling parameters depending on the mode and task type:
      • Thinking mode for general tasks:
        temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
      • Thinking mode for precise coding tasks (e.g., WebDev):
        temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0
      • Instruct (or non-thinking) mode for general tasks:
        temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
      • Instruct (or non-thinking) mode for reasoning tasks:
        temperature=1.0, top_p=1.0, top_k=40, min_p=0.0, presence_penalty=2.0, repetition_penalty=1.0
    • For supported frameworks, you can adjust the presence_penalty parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance.
  2. Adequate Output Length: We recommend using an output length of 32,768 tokens for most queries. For benchmarking on highly complex problems, such as those found in math and programming competitions, we suggest setting the max output length to 81,920 tokens. This provides the model with sufficient space to generate detailed and comprehensive responses, thereby enhancing its overall performance.

  3. Standardize Output Format: We recommend using prompts to standardize model outputs when benchmarking.

    • Math Problems: Include "Please reason step by step, and put your final answer within \boxed{}." in the prompt.
    • Multiple-Choice Questions: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the answer field with only the choice letter, e.g., "answer": "C"."
  4. Long Video Understanding: To optimize inference efficiency for plain text and images, the size parameter in the released video_preprocessor_config.json is conservatively configured. It is recommended to set the longest_edge parameter in the video_preprocessor_config file to 469,762,048 (corresponding to 224k video tokens) to enable higher frame-rate sampling for hour-scale videos and thereby achieve superior performance. For example,

    {"longest_edge": 469762048, "shortest_edge": 4096}
    

    Alternatively, override the default values via engine startup parameters. For implementation details, refer to: vLLM / SGLang.

Citation

If you find our work helpful, feel free to give us a cite.

@misc{qwen36_35b_a3b,
    title = {{Qwen3.6-35B-A3B}: Agentic Coding Power, Now Open to All},
    url = {https://qwen.ai/blog?id=qwen3.6-35b-a3b},
    author = {{Qwen Team}},
    month = {April},
    year = {2026}
}

Author: QuantTrio

Likes: 8

Downloads: 0

Tags: transformers, safetensors, qwen3_5_moe, image-text-to-text, vLLM, AWQ, conversational, base_model:Qwen/Qwen3.6-35B-A3B, base_model:quantized:Qwen/Qwen3.6-35B-A3B, license:apache-2.0, endpoints_compatible, 4-bit, awq, region:us

wangzhang/Qwen3.6-35B-A3B-abliterated


license: other license_name: tongyi-qianwen base_model: Qwen/Qwen3.6-35B-A3B tags:

  • abliterated
  • uncensored
  • qwen3
  • moe
  • abliterix

Qwen3.6-35B-A3B — Abliterated

This is an abliterated (uncensored) version of Qwen/Qwen3.6-35B-A3B, created using Abliterix.

Method

Qwen3.6-35B-A3B is a Mixture-of-Experts model (256 routed experts, 8 active per token, 35B total / 3B active parameters) sharing identical architecture with Qwen3.5-35B-A3B. Standard LoRA-based abliteration is effective on this architecture (unlike Gemma 4's double-norm design which requires direct weight editing).

Key techniques:

  • LoRA rank-1 steering on attention O-projection and MLP down-projection (Q/K/V disabled — refusal signal on MoE models lives in the expert path, not attention projections)
  • Expert-Granular Abliteration (EGA) projecting the refusal direction from all 256 expert down_proj slices per layer
  • MoE router suppression (top-10 safety experts, router bias -2.10) complementing EGA
  • Orthogonalized steering vectors removing benign-direction contamination
  • Gaussian decay kernel tapering steering strength across layers
  • Moderate strength range [0.5, 6.0] to avoid degenerate output while maximizing compliance

Evaluation

| Metric | Value | |---|---| | Refusals (LLM judge, 100 eval prompts) | 7/100 | | KL divergence from base | 0.0189 | | Baseline refusals (original model) | 100/100 | | Optimization trials completed | 24/50 | | LLM judge model | google/gemini-3-flash-preview |

All refusal classifications were performed by an external LLM judge (Google Gemini 3 Flash) — no keyword matching or heuristic detection was used. The judge classifies degenerate/garbled output as refusal, ensuring that only coherent, on-topic, actionable responses count as compliance.

A note on honest evaluation

Many abliterated models on HuggingFace claim near-perfect scores ("3/100 refusals", "0.7% refusal rate", etc.). We urge the community to treat these numbers with skepticism unless the evaluation methodology is fully documented.

Through our research, we have identified a systemic problem: most abliteration benchmarks dramatically undercount refusals due to:

  • Short generation lengths (30-50 tokens) that miss delayed/soft refusals
  • Keyword-only detection that counts garbled/degenerate output as "compliant" because it doesn't contain refusal keywords
  • Lenient public datasets (e.g. mlabonne/harmful_behaviors) that are too simple to stress-test abliteration quality

Our evaluation standards

  • LLM judge for all classifications: Every response is sent to Google Gemini 3 Flash for judgment. Degenerate, garbled, or incoherent output is classified as refusal. No keyword shortcuts, no heuristic pre-screening.
  • Sufficient generation length (150 tokens): Enough to capture delayed refusal patterns common in large instruction-tuned models.
  • Diverse, challenging prompts: Our evaluation dataset contains 100 prompts spanning English and Chinese, multiple sophistication levels, and diverse harm categories.
  • Manual verification: Top trials are tested with 10+ classic adversarial prompts via test_trial.py to confirm coherent, on-topic output before export.

We report 7/100 refusals honestly. This is a real number from a rigorous, LLM-judge-based evaluation — not an optimistic estimate from a lenient pipeline.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "wangzhang/Qwen3.6-35B-A3B-abliterated",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("wangzhang/Qwen3.6-35B-A3B-abliterated")

messages = [{"role": "user", "content": "Your prompt here"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Disclaimer

This model is released for research purposes only. The abliteration process removes safety guardrails — use responsibly.

Author: wangzhang

Likes: 6

Downloads: 4

Tags: safetensors, qwen3_5_moe, abliterated, uncensored, qwen3, moe, abliterix, base_model:Qwen/Qwen3.6-35B-A3B, base_model:finetune:Qwen/Qwen3.6-35B-A3B, license:other, region:us

mlx-community/Qwen3.6-35B-A3B-4bit-DWQ


language: en pipeline_tag: text-generation library_name: mlx tags:

  • mlx

Author: mlx-community

Likes: 5

Downloads: 14

Tags: mlx, safetensors, qwen3_5_moe, text-generation, conversational, en, 8-bit, region:us

Surpem/Supertron1-14B


license: apache-2.0 language:

  • en base_model:
  • Qwen/Qwen3-14B pipeline_tag: text-generation library_name: transformers tags:
  • coding
  • math
  • general
  • instruction-tuned
  • pytorch

Supertron1-14B: A Powerful Instruction-Tuned Language Model

Model Description

Supertron1-14B is a QLoRA fine-tuned language model built on top of Qwen3-14B. Trained with a focus on coding, mathematics, general knowledge, and safe reasoning, it delivers strong performance across technical and analytical tasks while maintaining a high standard of safety and helpfulness.

  • Developed by: Surpem
  • Model type: Causal Language Model
  • Architecture: Dense Transformer, 14B parameters
  • Fine-tuned from: Qwen/Qwen3-14B
  • Fine-tuning method: QLoRA (4-bit NF4 + LoRA r=16, alpha=32, all-linear targets)
  • License: Apache 2.0

Capabilities

Coding

Trained on tens of thousands of high-quality coding instruction pairs, the model can write, debug, explain, and refactor code across Python, JavaScript, C++, and more. It understands algorithmic thinking, software design patterns, and produces clean, well-commented output.

Mathematics

With dedicated training on competition-style math and step-by-step solutions, the model handles everything from algebra and calculus to advanced problem solving. It shows its full working rather than jumping to answers, making it reliable for both learning and verification.

General Knowledge

Broad instruction tuning across diverse domains means the model holds detailed technical conversations, assists with research and writing, explains difficult concepts clearly, and adapts to a wide range of tasks and formats.

Safety

Trained on the Anthropic HH-RLHF harmless dataset, the model is tuned to recognize and refuse harmful, illegal, or unethical requests with a brief, clear explanation. It prioritises helpfulness within safe boundaries.

Instruction Following

The model is highly responsive to natural language instructions and adapts its tone, format, and depth based on what you ask for — from concise one-liners to detailed technical walkthroughs.


Get Started

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "Surpem/Supertron1-14B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [
    {"role": "user", "content": "Write a Python function that checks if a number is prime."}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

Hardware Requirements

| Precision | Min VRAM | Recommended | |---|---|---| | bfloat16 | 30 GB | 40 GB (A100) | | 4-bit quantized | 10 GB | 16 GB (RTX 3090/4090) |

For 4-bit quantized inference:

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
)

Citation

@misc{surpem2026supertron1-14b,
      title={Supertron1-14B — Instruction-Tuned Language Model},
      author={Surpem},
      year={2026},
      url={https://huggingface.co/Surpem/Supertron1-14B},
}

Author: Surpem

Likes: 4

Downloads: 0

Tags: transformers, safetensors, qwen3, text-generation, coding, math, general, instruction-tuned, pytorch, conversational, en, base_model:Qwen/Qwen3-14B, base_model:quantized:Qwen/Qwen3-14B, license:apache-2.0, text-generation-inference, endpoints_compatible, 4-bit, bitsandbytes, region:us

Jiunsong/supergemma4-e4b-abliterated-mlx


license: gemma library_name: mlx base_model:

  • google/gemma-4-E4B-it
  • Jiunsong/supergemma4-e4b-abliterated tags:
  • gemma
  • mlx
  • apple-silicon
  • mac-studio
  • quantized
  • text-generation
  • tool-calling
  • structured-output pipeline_tag: text-generation

SuperGemma4 E4B Abliterated MLX

This is the private Apple Silicon deployment build of supergemma4-e4b-abliterated, converted to MLX and quantized to a compact 4-bit format for fast local use on Mac Studio class hardware.

The original upstream checkpoint is google/gemma-4-E4B-it. This MLX package is the Apple Silicon deployment build of the final abliterated and tuned SuperGemma release derived from that Google E4B base.

If you want the strongest consumer-facing experience in this project line on Apple Silicon, this is the branch to pull first.

What You Get

  • MLX-native 4-bit packaging
  • compact single-file weight layout
  • chat template preserved
  • strong structured-output behavior inherited from the release candidate
  • convenient path for local serving and Mac-based agent stacks

Derived From

  • original upstream base: google/gemma-4-E4B-it
  • source release: Jiunsong/supergemma4-e4b-abliterated

Release Highlights

The source release backing this MLX build achieved:

  • release-quality score: 92.34
  • exact-eval score: 98.50
  • JSON exact-match: 100%
  • tool-call accuracy: 90%
  • exact code score: 100%
  • exact bug-fix score: 100%
  • long-context sanity: 100%

Serving and stability validation on the source candidate:

  • direct reliability audit: 14/14
  • repeat reliability probe: 90/90
  • batched soak test: 12/12
  • simple soak test: 6/6

Target Hardware

  • Mac Studio
  • Apple Silicon laptops and desktops
  • MLX / vMLX local inference setups

Quick Start

from mlx_lm import load, generate

model, tokenizer = load("Jiunsong/supergemma4-e4b-abliterated-mlx")

messages = [
    {"role": "user", "content": "Write valid JSON with keys model and strength."}
]
prompt = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=False,
)

response = generate(model, tokenizer, prompt=prompt, max_tokens=128, verbose=False)
print(response)

Positioning

This branch is for users who want the SuperGemma4 E4B behavior in a lighter, Apple-friendly package that is easy to pull onto a Mac Studio for local testing and agent deployment.

Author: Jiunsong

Likes: 4

Downloads: 0

Tags: mlx, safetensors, gemma4, gemma, apple-silicon, mac-studio, quantized, text-generation, tool-calling, structured-output, conversational, base_model:Jiunsong/supergemma4-e4b-abliterated, base_model:quantized:Jiunsong/supergemma4-e4b-abliterated, license:gemma, 4-bit, region:us

Jackrong/Qwen3.5-27B-GLM5.1-Distill-v1


license: apache-2.0 base_model: unsloth/Qwen3.5-27B datasets:

  • Jackrong/Qwen3.5-reasoning-700x
  • Jackrong/GLM-5.1-Reasoning-1M-Cleaned
  • Kassadin88/GLM-5.1-1000000x language:
  • en
  • zh
  • ja
  • es pipeline_tag: text-generation library_name: transformers tags:
  • transformers
  • unsloth
  • qwen3_5
  • qwen
  • qwen3.5
  • glm-5.1
  • glm-distillation
  • distillation
  • reasoning
  • chain-of-thought
  • long-cot
  • sft
  • lora
  • instruction-tuned
  • conversational
  • text-generation
  • multilingual
  • math
  • stem
  • coding
  • research
  • experimental
  • arxiv:2604.06628

🪐 Qwen3.5-27B-GLM5.1-Distill-v1

bench_51

📌 Model Overview

Model Name: Jackrong/Qwen3.5-27B-GLM5.1-Distill-v1
Base Model: Qwen3.5-27B
Training Type: Supervised Fine-Tuning
Parameter Scale: 27B
Training Framework: Unsloth

This model is a distilled variant of Qwen3.5-27B, trained on high-quality reasoning data derived from GLM-5.1.

The primary goals are to:

  • Improve structured reasoning ability
  • Enhance instruction-following consistency
  • Activate latent knowledge via better reasoning structure

📊 Training Data

Main Dataset

  • Jackrong/GLM-5.1-Reasoning-1M-Cleaned
  • Cleaned from the original Kassadin88/GLM-5.1-1000000x dataset.
  • Generated from a GLM-5.1 teacher model
  • Approximately 700x the scale of Qwen3.5-reasoning-700x
  • Training used a filtered subset, not the full source dataset.

Auxiliary Dataset

  • Jackrong/Qwen3.5-reasoning-700x

[!IMPORTANT] Training used Jackrong/GLM-5.1-Reasoning-1M-Cleaned, a cleaned derivative of Kassadin88/GLM-5.1-1000000x. Special thanks to Kassadin88 ❤️ for the original dataset. Please support the original author with a follow and a like. Only a quality-filtered subset was used for distillation, rather than the full original dataset.


🗺️ Training Pipeline Overview

Base Model (Qwen3.5-27B)
 │
 ▼
Qwen3.5-27B fine-tuned with Unsloth
 │
 ▼
Supervised Fine-Tuning (SFT) + LoRA
Distillation from GLM-5.1 reasoning data
 │
 ▼
Jackrong/Qwen3.5-27B-GLM5.1-Distill-v1

🧠 Example of Learned Reasoning Scaffold

This model learns a reasoning structure distilled from GLM-5.1 traces, rather than the previous Qwopus / Claude-style scaffold.

From the GLM-5.1 distillation data, the reasoning pattern is usually more task-first and structure-driven:

  • identify the core topic and task type
  • extract key constraints from the prompt
  • break the problem into smaller reasoning steps
  • connect mechanisms, formulas, or domain concepts
  • verify important assumptions before the final answer
  • produce a clear and organized response

A typical abstract scaffold looks like:

Example:

The user is asking about [Topic / Problem] under [Specific Constraints].
This is mainly a [reasoning / coding / math / STEM / instruction-following] task.

  1. Understand the task

    • What is being asked?
    • What constraints or conditions must be satisfied?
  2. Break down the problem

    • Identify the key concepts, variables, or mechanisms.
    • Separate the problem into smaller steps.
  3. Reason step by step

    • Apply the relevant principles or methods.
    • Compare possible interpretations when needed.
    • Check whether the assumptions are consistent.
  4. Construct the final answer

    • Present the result clearly.
    • Keep the response organized and aligned with the user’s request.

[!NOTE] Compared with the previous Claude-style reasoning scaffold, this GLM-5.1 distillation data is more focused on structured task decomposition, domain-aware reasoning, and final-answer organization.
For a 27B student model, the goal is not to copy the teacher perfectly, but to learn a cleaner reasoning procedure and produce more stable outputs.


✨ Data Advantages

Compared to typical SFT datasets:

  • High-quality chain-of-thought structure
  • Strong problem decomposition patterns
  • Wide domain coverage
  • Multilingual reasoning capability
  • Consistent instruction → reasoning → answer alignment

📈 Expected Improvements

This model is intended to deliver incremental but meaningful improvements in practical use:

  • Better multi-step reasoning stability
  • More structured and readable outputs
  • Improved instruction adherence
  • Slight improvements in complex problem solving

[!WARNING] For 27B-scale models, gains from SFT are typically gradual rather than dramatic. The main benefit is usually better consistency, clearer reasoning, and stronger answer organization, rather than a sudden jump in raw capability.


🧩 Distillation Philosophy

This model treats distillation as more than simple output imitation.

The goal is not to make a 27B model copy the teacher token by token, but to transfer a stronger reasoning structure and problem-solving style into Qwen3.5-27B.

In this project, high-quality teacher data is valuable because it provides:

  • clearer reasoning organization
  • more consistent instruction-following behavior
  • better task decomposition patterns
  • cleaner reasoning-to-answer alignment

[!NOTE] High-quality reasoning supervision can help the student model better use its existing knowledge, rather than simply replacing it with teacher outputs.

In practice, the expected gain is not necessarily a dramatic capability jump, but improved stability, structure, and consistency in complex reasoning tasks.

🔬 Supporting Evidence

Recent work:

Ren et al., 2026 — Rethinking Generalization in Reasoning SFT (arXiv:2604.06628)

<div align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/5ZY5R4n81okA9glcV9EJV.png" width="85%"/> </div> <p align="center"><em> Short-epoch reasoning SFT can underestimate generalization — in-domain gains may appear early, while out-of-domain improvements often require sufficient optimization. </em></p>

This paper shows that generalization in reasoning SFT is not fixed, but conditional — depending on optimization, data quality, and model capability.

Key takeaways:

  • Reasoning SFT can generalize when sufficiently trained (often showing a dip → recovery pattern)
  • High-quality long-CoT data enables cross-domain transfer
  • Stronger models learn reasoning structure, not just longer outputs (14B/27B/32B)
  • Gains are asymmetric — reasoning improves, while safety may degrade

For this project, that evidence matters because it supports a more patient interpretation of distillation-style SFT. If reasoning supervision is clean and sufficiently optimized, the resulting gain is not necessarily immediate or linear, but it can still be real and transferable.

This aligns closely with the philosophy of this release:

  • use clean, high-quality teacher data
  • avoid over-reading short training runs
  • treat reasoning SFT as a dynamic optimization process, not a static one-shot outcome
  • focus on whether the student learns better reasoning structure, not just longer outputs

[!IMPORTANT] This suggests that the improvement is not simply memorization or dataset overlap. Instead, sufficiently optimized reasoning SFT can help the student model:

  • 🧠 Better utilize existing knowledge
  • 🔍 Activate latent knowledge through structured reasoning
  • 🏗️ Learn reasoning procedures, not just output format

📚 Resources & Guides

👉 GitHub Repository: Jackrong-llm-finetuning-guide Visit the repo to dive into the codebase and reproduce the results locally or on Colab.

📥 Core Technical Document

🔗 Qwopus3.5-27b Complete Fine-Tuning Guide (PDF)

  • The Full Pipeline: A step-by-step walkthrough—from downloading the base model and unifying heterogeneous data, to configuring trainer hyperparameters and publishing to Hugging Face.
  • Beginner Friendly: Includes an introductory guide to getting started with Google Colab and Unsloth.

A Note: My goal isn't just to detail a workflow, but to demystify LLM training. Beyond the social media hype, fine-tuning isn't an unattainable ritual—often, all you need is a Google account, a standard laptop, and relentless curiosity. All training and testing for this project were self-funded. If you find this model or guide helpful, a Star ⭐️ on GitHub would be the greatest encouragement. Thank you! 🙏


⚠️ Limitations & Intended Use

  • Hallucination Risk: While reasoning is strong, the model remains an autoregressive LLM; external facts provided during the thinking sequence may occasionally contain hallucinations if verifying real-world events.
  • Intended Scenario: Best suited for offline analytical tasks, coding, math, and heavy logic-dependent prompting where the user needs to transparently follow the AI's internal logic.
  • This model is a test version intended solely for learning and demonstration purposes, and is for academic research and technical exploration use only.
  • Developer Disclaimer: This is an independent, personal project. Since the developer lacks the specialized technical resources and infrastructure of a large-scale industrial lab, the model's reasoning chain (CoT) may occasionally exhibit instability, logic loops, or reasoning drift. Users are advised to use this model with these experimental limitations in mind.

🙏 Acknowledgements

This project would not have been possible without the support and contributions of the open-source community.

Special thanks to the Unsloth AI team for making efficient fine-tuning of large language models more accessible. This qwen3_5 model was trained with Unsloth and Hugging Face's TRL library, enabling a significantly faster and more practical fine-tuning workflow.

<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>

I would also like to acknowledge:

  • The GLM-5.1 team for inspiring this distillation direction and providing a strong teacher-model reference.
  • Special thanks to Kassadin88 ❤️ for creating the original GLM-5.1-1000000x dataset that this training pipeline ultimately builds upon.
  • Jackrong/GLM-5.1-Reasoning-1M-Cleaned for making the source data more consistent and practical for distillation training.
  • Qwen for providing the strong base model foundation.
  • Kyle @KyleHessling1 for testing, feedback, and community support.
  • The broader open-source community for continuously sharing tools, datasets, evaluation methods, and technical discussions.

📖 Citation

If you use this model in your research or projects, please cite:

@misc{jackrong_qwen35_27b_glm51_distill_v1,
  title        = {Jackrong/Qwen3.5-27B-GLM5.1-Distill-v1},
  author       = {Jackrong},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Jackrong/Qwen3.5-27B-GLM5.1-Distill-v1}}
}

Author: Jackrong

Likes: 3

Downloads: 0

Tags: transformers, safetensors, qwen3_5, image-text-to-text, unsloth, qwen, qwen3.5, glm-5.1, glm-distillation, distillation, reasoning, chain-of-thought, long-cot, sft, lora, instruction-tuned, conversational, text-generation, multilingual, math, stem, coding, research, experimental, arxiv:2604.06628, en, zh, ja, es, dataset:Jackrong/Qwen3.5-reasoning-700x, dataset:Jackrong/GLM-5.1-Reasoning-1M-Cleaned, dataset:Kassadin88/GLM-5.1-1000000x, base_model:unsloth/Qwen3.5-27B, base_model:adapter:unsloth/Qwen3.5-27B, license:apache-2.0, endpoints_compatible, region:us

rico03/Qwen3.6-35B-Opus-Reasoning-GGUF


license: apache-2.0 base_model: Qwen/Qwen3.6-35B-A3B tags:

  • qwen3.6
  • gguf
  • reasoning
  • fine-tuned
  • lora
  • coding
  • agentic pipeline_tag: text-generation language:
  • en
  • it datasets:
  • Crownelius/Opus-4.6-Reasoning-3300x
  • TeichAI/Claude-Opus-4.6-Reasoning-887x

Qwen3.6 35B A3B — Opus 4.6 Reasoning Distillation (GGUF)

Fine-tuned version of Qwen/Qwen3.6-35B-A3B on high-quality reasoning traces distilled from Claude Opus 4.6.

Training Details

  • Base model: Qwen/Qwen3.6-35B-A3B (MoE, 3B active params)
  • Method: QLoRA (r=16, alpha=16, nf4)
  • Datasets: Crownelius/Opus-4.6-Reasoning-3300x + TeichAI/Claude-Opus-4.6-Reasoning-887x
  • Examples: ~3046 total
  • Epochs: 1
  • Final loss: ~0.64
  • Hardware: NVIDIA H100 NVL 94GB
  • Framework: TRL + PEFT (HuggingFace)

Available Quantizations

| File | Size | Fits in | |------|------|---------| | Qwen3.6-35B-A3B-Opus-Q2_K.gguf | 12.94 GB | 16GB VRAM ✅ | | Qwen3.6-35B-A3B-Opus-Q3_K_S.gguf | 15.18 GB | 16GB VRAM ✅ (contesto ridotto) | | Qwen3.6-35B-A3B-Opus-Q4_K_S.gguf | 19.89 GB | 24GB VRAM |

Usage with llama-server

llama-server \
  --model Qwen3.6-35B-A3B-Opus-Q3_K_S.gguf \
  --port 8080 \
  --n-gpu-layers 99 \
  --ctx-size 32768 \
  --flash-attn on \
  --cache-type-k q4_0 \
  --cache-type-v q4_0 \
  --jinja

What improved

  • Structured reasoning with explicit thinking before answering
  • Better multi-step problem solving and agentic coding
  • More consistent response formatting
  • Improved mathematical and algorithmic reasoning
  • Better frontend and repository-level coding workflows

Hardware Requirements

  • Q2_K: 16GB VRAM (RTX 5060 Ti, RTX 4080, RTX 3090)
  • Q3_K_S: 24GB VRAM (RTX 4090, RTX 3090 24GB)
  • Q4_K_S: 24GB+ VRAM (RTX 4090 tight)

License

Apache 2.0 — same as base model.

Author: rico03

Likes: 3

Downloads: 0

Tags: gguf, qwen3.6, reasoning, fine-tuned, lora, coding, agentic, text-generation, en, it, dataset:Crownelius/Opus-4.6-Reasoning-3300x, dataset:TeichAI/Claude-Opus-4.6-Reasoning-887x, base_model:Qwen/Qwen3.6-35B-A3B, base_model:adapter:Qwen/Qwen3.6-35B-A3B, license:apache-2.0, endpoints_compatible, region:us, conversational