Todays AI Summary

AI Developments: Real-Time Medical Imaging, Video Motion Editing, and Latent Multi-Agent Collaboration

Here's a look at some of the most interesting AI advancements from today, covering medical imaging, video editing, multi-agent systems, and more.

Research Highlights

  • MedROV: Real-Time Open-Vocabulary Detection in Medical Imaging: A new paper introduces MedROV, a real-time open-vocabulary object detection model for medical imaging. It addresses the limitations of closed-set models by using a large-scale dataset (Omnis) and a pseudo-labeling strategy. MedROV achieves significant improvements over existing models while running at 70 FPS.
  • MotionV2V: Editing Motion in Video: This research explores precise motion control for video editing. The proposed method involves editing sparse trajectories extracted from the input video, enabling powerful editing capabilities. The model generates "motion counterfactuals" and is fine-tuned on a motion-conditioned video diffusion architecture. User studies show a strong preference for this model over prior work.
  • LatentMAS: Latent Collaboration in Multi-Agent Systems: This paper introduces LatentMAS, a framework that enables pure latent collaboration among LLM agents. Agents collaborate directly within the continuous latent space, leading to higher expressiveness, lossless information preservation, and lower complexity compared to text-based MAS. LatentMAS outperforms strong baselines across various benchmarks.
  • MapReduce LoRA: Advancing Multi-Preference Optimization: This research introduces MapReduce LoRA and Reward-aware Token Embedding (RaTE) to address the alignment tax in multi-preference optimization for generative models. MapReduce LoRA trains preference-specific LoRA experts and iteratively merges them, while RaTE learns reward-specific token embeddings. Experiments show significant improvements across text-to-image, text-to-video, and language tasks.
  • Fighting AI with AI: Assuring AI-Enabled Safety-Critical Systems: This paper proposes using AI to address the challenges of integrating AI components into safety-critical systems. REACT employs LLMs to bridge the gap between informal requirements and formal specifications, while SemaLens utilizes VLMs to reason about and test DNN-based perception systems.
  • ROOT: Robust Orthogonalized Optimizer for Neural Network Training: This paper introduces ROOT, a Robust Orthogonalized Optimizer that enhances training stability through dual robustness mechanisms. It features a dimension-robust orthogonalization scheme and an optimization-robust framework via proximal optimization. Experiments demonstrate improved robustness, faster convergence, and superior final performance.
  • Copyright Detection in Large Language Models: This paper introduces an open-source copyright detection platform that enables content creators to verify whether their work was used in LLM training datasets. The approach enhances existing methodologies by facilitating ease of use, improving similarity detection, optimizing dataset validation, and reducing computational overhead.
  • DiFR: Inference Verification Despite Nondeterminism: This paper introduces Token-DiFR and Activation-DiFR, methods for verifying inference outputs by comparing generated tokens against predictions made by a trusted reference implementation conditioned on the same random seed. Token-DiFR reliably identifies sampling errors, simulated bugs, and model quantization. Activation-DiFR detects 4-bit quantization with high accuracy while reducing communication overhead.
  • Evaluating Deep Learning Models in Whole-body Dynamic 3D Posture Prediction: This study explores the application of deep neural networks for whole-body human posture prediction during dynamic load-reaching activities. The results indicated that utilizing the transformer architecture exhibited more accurate long-term performance than the BLSTM-based model.
  • Can Vibe Coding Beat Graduate CS Students?: This paper introduces a multi-agent reasoning-driven benchmark based on a real-world logistics optimization problem. The results demonstrate a clear superiority of human-coded agents, highlighting a gap in LLMs' ability to produce code that works competitively in the real world.

Model Releases

  • Comfy-Org/z_image_turbo: This model focuses on diffusion and is designed for use with ComfyUI. It has received 33 likes, indicating a moderate level of community interest.
  • OpenMOSE/Qwen3-VL-REAP-24B-A3B-GGUF: A Vision–Language MoE model created by applying Router-weighted Expert Activation Pruning (REAP) to Qwen3-VL-30B.
  • T5B/Z-Image-Turbo-FP8: This model is a quantization of Tongyi-MAI/Z-Image-Turbo to FP8 E5M2 and FP8 E4M3FN.
  • Disty0/Z-Image-Turbo-SDNQ-uint4-svd-r32: A 4

AI Papers for 2026-02-17

Semantic Chunking and the Entropy of Natural Language

The entropy rate of printed English is famously estimated to be about one bit per character, a benchmark that modern large language models (LLMs) have only recently approached. This entropy rate implies that English contains nearly 80 percent redundancy relative to the five bits per character expected for random text. We introduce a statistical model that attempts to capture the intricate multi-scale structure of natural language, providing a first-principles account of this redundancy level. Our model describes a procedure of self-similarly segmenting text into semantically coherent chunks down to the single-word level. The semantic structure of the text can then be hierarchically decomposed, allowing for analytical treatment. Numerical experiments with modern LLMs and open datasets suggest that our model quantitatively captures the structure of real texts at different levels of the semantic hierarchy. The entropy rate predicted by our model agrees with the estimated entropy rate of printed English. Moreover, our theory further reveals that the entropy rate of natural language is not fixed but should increase systematically with the semantic complexity of corpora, which are captured by the only free parameter in our model.

CoPE-VideoLM: Codec Primitives For Efficient Video Language Models

Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos. To fit to the maximum context window constraint, current methods use keyframe sampling which can miss both macro-level events and micro-level details due to the sparse temporal coverage. Furthermore, processing full images and their tokens for each frame incurs substantial computational overhead. To address these limitations, we propose to leverage video codec primitives (specifically motion vectors and residuals) which natively encode video redundancy and sparsity without requiring expensive full-image encoding for most frames. To this end, we introduce lightweight transformer-based encoders that aggregate codec primitives and align their representations with image encoder embeddings through a pre-training strategy that accelerates convergence during end-to-end fine-tuning. Our approach reduces the time-to-first-token by up to $86\%$ and token usage by up to $93\%$ compared to standard VideoLMs. Moreover, by varying the keyframe and codec primitive densities we are able to maintain or exceed performance on $14$ diverse video understanding benchmarks spanning general question answering, temporal reasoning, long-form understanding, and spatial scene understanding.

Optimal Take-off under Fuzzy Clearances

This paper presents a hybrid obstacle avoidance architecture that integrates Optimal Control under clearance with a Fuzzy Rule Based System (FRBS) to enable adaptive constraint handling for unmanned aircraft. Motivated by the limitations of classical optimal control under uncertainty and the need for interpretable decision making in safety critical aviation systems, we design a three stage Takagi Sugeno Kang fuzzy layer that modulates constraint radii, urgency levels, and activation decisions based on regulatory separation minima and airworthiness guidelines from FAA and EASA. These fuzzy-derived clearances are then incorporated as soft constraints into an optimal control problem solved using the FALCON toolbox and IPOPT. The framework aims to reduce unnecessary recomputations by selectively activating obstacle avoidance updates while maintaining compliance with aviation procedures. A proof of concept implementation using a simplified aircraft model demonstrates that the approach can generate optimal trajectories with computation times of 2,3 seconds per iteration in a single threaded MATLAB environment, suggesting feasibility for near real time applications. However, our experiments revealed a critical software incompatibility in the latest versions of FALCON and IPOPT, in which the Lagrangian penalty term remained identically zero, preventing proper constraint enforcement. This behavior was consistent across scenarios and indicates a solver toolbox regression rather than a modeling flaw. Future work includes validating this effect by reverting to earlier software versions, optimizing the fuzzy membership functions using evolutionary methods, and extending the system to higher fidelity aircraft models and stochastic obstacle environments.

Asynchronous Verified Semantic Caching for Tiered LLM Architectures

Large language models (LLMs) now sit in the critical path of search, assistance, and agentic workflows, making semantic caching essential for reducing inference cost and latency. Production deployments typically use a tiered static-dynamic design: a static cache of curated, offline vetted responses mined from logs, backed by a dynamic cache populated online. In practice, both tiers are commonly governed by a single embedding similarity threshold, which induces a hard tradeoff: conservative thresholds miss safe reuse opportunities, while aggressive thresholds risk serving semantically incorrect responses. We introduce \textbf{Krites}, an asynchronous, LLM-judged caching policy that expands static coverage without changing serving decisions. On the critical path, Krites behaves exactly like a standard static threshold policy. When the nearest static neighbor of the prompt falls just below the static threshold, Krites asynchronously invokes an LLM judge to verify whether the static response is acceptable for the new prompt. Approved matches are promoted into the dynamic cache, allowing future repeats and paraphrases to reuse curated static answers and expanding static reach over time. In trace-driven simulations on conversational and search workloads, Krites increases the fraction of requests served with curated static answers (direct static hits plus verified promotions) by up to $\textbf{3.9}$ times for conversational traffic and search-style queries relative to tuned baselines, with unchanged critical path latency.

In-Context Autonomous Network Incident Response: An End-to-End Large Language Model Agent Approach

Rapidly evolving cyberattacks demand incident response systems that can autonomously learn and adapt to changing threats. Prior work has extensively explored the reinforcement learning approach, which involves learning response strategies through extensive simulation of the incident. While this approach can be effective, it requires handcrafted modeling of the simulator and suppresses useful semantics from raw system logs and alerts. To address these limitations, we propose to leverage large language models' (LLM) pre-trained security knowledge and in-context learning to create an end-to-end agentic solution for incident response planning. Specifically, our agent integrates four functionalities, perception, reasoning, planning, and action, into one lightweight LLM (14b model). Through fine-tuning and chain-of-thought reasoning, our LLM agent is capable of processing system logs and inferring the underlying network state (perception), updating its conjecture of attack models (reasoning), simulating consequences under different response strategies (planning), and generating an effective response (action). By comparing LLM-simulated outcomes with actual observations, the LLM agent repeatedly refines its attack conjecture and corresponding response, thereby demonstrating in-context adaptation. Our agentic approach is free of modeling and can run on commodity hardware. When evaluated on incident logs reported in the literature, our agent achieves recovery up to 23% faster than those of frontier LLMs.

Constrained Assumption-Based Argumentation Frameworks

Assumption-based Argumentation (ABA) is a well-established form of structured argumentation. ABA frameworks with an underlying atomic language are widely studied, but their applicability is limited by a representational restriction to ground (variable-free) arguments and attacks built from propositional atoms. In this paper, we lift this restriction and propose a novel notion of constrained ABA (CABA), whose components, as well as arguments built from them, may include constrained variables, ranging over possibly infinite domains. We define non-ground semantics for CABA, in terms of various notions of non-ground attacks. We show that the new semantics conservatively generalise standard ABA semantics.

SCOPE: Selective Conformal Optimized Pairwise LLM Judging

Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation. Despite their practicality, LLM judges remain prone to miscalibration and systematic biases. This paper proposes SCOPE (Selective Conformal Optimized Pairwise Evaluation), a framework for selective pairwise judging with finite-sample statistical guarantees. Under exchangeability, SCOPE calibrates an acceptance threshold such that the error rate among non-abstained judgments is at most a user-specified level $α$. To provide SCOPE with a bias-neutral uncertainty signal, we introduce Bidirectional Preference Entropy (BPE), which queries the judge under both response positions, aggregates the implied preference probabilities to enforce invariance to response order, and converts the aggregated probability into an entropy-based uncertainty score. Across MT-Bench, RewardBench, and Chatbot Arena, BPE improves uncertainty quality over standard confidence proxies, providing a stronger selection signal that enables SCOPE to consistently meet the target risk level while retaining good coverage across judge scales. In particular, at $α= 0.10$, \textsc{Scope} consistently satisfies the risk bound across all benchmarks and judge scales (empirical risk $\approx 0.097$ to $0.099$), while retaining substantial coverage, reaching $0.89$ on RewardBench with Qwen-14B and $0.98$ on RewardBench with Qwen-32B. Compared to naïve baselines, \textsc{Scope} accepts up to $2.4\times$ more judgments on MT-Bench with Qwen-7B under the same target risk constraint, demonstrating that BPE enables reliable and high-coverage LLM-based evaluation.

Which Algorithms Can Graph Neural Networks Learn?

In recent years, there has been growing interest in understanding neural architectures' ability to learn to execute discrete algorithms, a line of work often referred to as neural algorithmic reasoning. The goal is to integrate algorithmic reasoning capabilities into larger neural pipelines. Many such architectures are based on (message-passing) graph neural networks (MPNNs), owing to their permutation equivariance and ability to deal with sparsity and variable-sized inputs. However, existing work is either largely empirical and lacks formal guarantees or it focuses solely on expressivity, leaving open the question of when and how such architectures generalize beyond a finite training set. In this work, we propose a general theoretical framework that characterizes the sufficient conditions under which MPNNs can learn an algorithm from a training set of small instances and provably approximate its behavior on inputs of arbitrary size. Our framework applies to a broad class of algorithms, including single-source shortest paths, minimum spanning trees, and general dynamic programming problems, such as the $0$-$1$ knapsack problem. In addition, we establish impossibility results for a wide range of algorithmic tasks, showing that standard MPNNs cannot learn them, and we derive more expressive MPNN-like architectures that overcome these limitations. Finally, we refine our analysis for the Bellman-Ford algorithm, yielding a substantially smaller required training set and significantly extending the recent work of Nerem et al. [2025] by allowing for a differentiable regularization loss. Empirical results largely support our theoretical findings.

Consistency of Large Reasoning Models Under Multi-Turn Attacks

Large reasoning models with reasoning capabilities achieve state-of-the-art performance on complex tasks, but their robustness under multi-turn adversarial pressure remains underexplored. We evaluate nine frontier reasoning models under adversarial attacks. Our findings reveal that reasoning confers meaningful but incomplete robustness: most reasoning models studied significantly outperform instruction-tuned baselines, yet all exhibit distinct vulnerability profiles, with misleading suggestions universally effective and social pressure showing model-specific efficacy. Through trajectory analysis, we identify five failure modes (Self-Doubt, Social Conformity, Suggestion Hijacking, Emotional Susceptibility, and Reasoning Fatigue) with the first two accounting for 50% of failures. We further demonstrate that Confidence-Aware Response Generation (CARG), effective for standard LLMs, fails for reasoning models due to overconfidence induced by extended reasoning traces; counterintuitively, random confidence embedding outperforms targeted extraction. Our results highlight that reasoning capabilities do not automatically confer adversarial robustness and that confidence-based defenses require fundamental redesign for reasoning models.

How cyborg propaganda reshapes collective action

The distinction between genuine grassroots activism and automated influence operations is collapsing. While policy debates focus on bot farms, a distinct threat to democracy is emerging via partisan coordination apps and artificial intelligence-what we term 'cyborg propaganda.' This architecture combines large numbers of verified humans with adaptive algorithmic automation, enabling a closed-loop system. AI tools monitor online sentiment to optimize directives and generate personalized content for users to post online. Cyborg propaganda thereby exploits a critical legal shield: by relying on verified citizens to ratify and disseminate messages, these campaigns operate in a regulatory gray zone, evading liability frameworks designed for automated botnets. We explore the collective action paradox of this technology: does it democratize power by 'unionizing' influence (pooling the reach of dispersed citizens to overcome the algorithmic invisibility of isolated voices), or does it reduce citizens to 'cognitive proxies' of a central directive? We argue that cyborg propaganda fundamentally alters the digital public square, shifting political discourse from a democratic contest of individual ideas to a battle of algorithmic campaigns. We outline a research agenda to distinguish organic from coordinated information diffusion and propose governance frameworks to address the regulatory challenges of AI-assisted collective expression.

AI Models

Qwen/Qwen3.5-397B-A17B


library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3.5-397B-A17B/blob/main/LICENSE pipeline_tag: image-text-to-text

Qwen3.5-397B-A17B

<img width="400px" src="https://qianwen-res.oss-accelerate.aliyuncs.com/logo_qwen3.5.png">

Qwen Chat

[!Note] This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format.

These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, etc.

[!Tip] For users seeking managed, scalable inference without infrastructure maintenance, the official Qwen API service is provided by Alibaba Cloud Model Studio.

In particular, Qwen3.5-Plus is the hosted version corresponding to Qwen3.5-397B-A17B with more production features, e.g., 1M context length by default, official built-in tools, and adaptive tool use. For more information, please refer to the User Guide.

Over recent months, we have intensified our focus on developing foundation models that deliver exceptional utility and performance. Qwen3.5 represents a significant leap forward, integrating breakthroughs in multimodal learning, architectural efficiency, reinforcement learning scale, and global accessibility to empower developers and enterprises with unprecedented capability and efficiency.

Qwen3.5 Highlights

Qwen3.5 features the following enhancement:

  • Unified Vision-Language Foundation: Early fusion training on multimodal tokens achieves cross-generational parity with Qwen3 and outperforms Qwen3-VL models across reasoning, coding, agents, and visual understanding benchmarks.

  • Efficient Hybrid Architecture: Gated Delta Networks combined with sparse Mixture-of-Experts deliver high-throughput inference with minimal latency and cost overhead.

  • Scalable RL Generalization: Reinforcement learning scaled across million-agent environments with progressively complex task distributions for robust real-world adaptability.

  • Global Linguistic Coverage: Expanded support to 201 languages and dialects, enabling inclusive, worldwide deployment with nuanced cultural and regional understanding.

  • Next-Generation Training Infrastructure: Near-100% multimodal training efficiency compared to text-only training and asynchronous RL frameworks supporting massive-scale agent scaffolds and environment orchestration.

Benchmark Results

For more details, please refer to our blog post Qwen3.5.

Model Overview

  • Type: Causal Language Model with Vision Encoder
  • Training Stage: Pre-training & Post-training
  • Language Model
    • Number of Parameters: 397B in total and 17B activated
    • Hidden Dimension: 4096
    • Token Embedding: 248320 (Padded)
    • Number of Layers: 60
      • Hidden Layout: 15 * (3 * (Gated DeltaNet -> MoE) -> 1 * (Gated Attention -> MoE))
    • Gated DeltaNet:
      • Number of Linear Attention Heads: 64 for V and 16 for QK
      • Head Dimension: 128
    • Gated Attention:
      • Number of Attention Heads: 32 for Q and 2 for KV
      • Head Dimension: 256
      • Rotary Position Embedding Dimension: 64
    • Mixture Of Experts
      • Number of Experts: 512
      • Number of Activated Experts: 10 Routed + 1 Shared
      • Expert Intermediate Dimension: 1024
    • LM Output: 248320 (Padded)
    • MTP: trained with multi-steps
  • Context Length: 262,144 natively and extensible up to 1,010,000 tokens.

Benchmark Results

Language

<div style="font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,sans-serif;max-width:900px;margin:0 auto;padding:16px 0"> <table style="width:100%;border-collapse:collapse;font-size:13px"> <thead><tr> <th style="padding:10px 12px;text-align:left;font-weight:600;border-bottom:2px solid #7c3aed;color:#7c3aed"></th> <th style="padding:10px 12px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">GPT5.2</th> <th style="padding:10px 12px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">Claude 4.5 Opus</th> <th style="padding:10px 12px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">Gemini-3 Pro</th> <th style="padding:10px 12px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">Qwen3-Max-Thinking</th> <th style="padding:10px 12px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">K2.5-1T-A32B</th> <th style="padding:10px 12px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">Qwen3.5-397B-A17B</th> </tr></thead> <tbody> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">Knowledge</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MMLU-Pro</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">89.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">89.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.8</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MMLU-Redux</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">95.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">95.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">95.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">92.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">94.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">94.9</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">SuperGPQA</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">67.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">74.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">67.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">69.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.4</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">C-Eval</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">92.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">94.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.0</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">Instruction Following</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">IFEval</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">94.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">92.6</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">IFBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">75.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">58.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">76.5</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MultiChallenge</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">57.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">54.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">64.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">63.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">62.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">67.6</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">Long Context</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">AA-LCR</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">72.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">74.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.7</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">LongBench v2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">54.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">64.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">60.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">61.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">63.2</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">STEM</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">GPQA</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">92.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">91.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">88.4</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">HLE</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">35.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">30.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">37.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">30.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">30.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">28.7</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">HLE-Verified¹</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">43.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">38.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">48</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">37.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">37.6</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">Reasoning</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">LiveCodeBench v6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.6</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">HMMT Feb 25</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">99.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">92.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">97.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">98.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">95.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">94.8</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">HMMT Nov 25</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">100</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">94.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">91.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">92.7</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">IMOAnswerBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.9</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">AIME26</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">96.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">91.3</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">General Agent</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">BFCL-V4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">63.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">72.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">67.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">72.9</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">TAU2-Bench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">91.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.7</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">VITA-Bench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">38.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">56.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">51.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">40.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">41.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">49.7</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">DeepPlanning</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">44.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">33.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">23.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">28.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">14.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">34.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">Tool Decathlon</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">43.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">43.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">36.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">18.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">27.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">38.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MCP-Mark</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">57.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">42.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">53.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">33.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">29.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">46.1</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">Search Agent³</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">HLE w/ tool</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">45.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">43.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">45.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">49.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">50.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">48.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">BrowseComp</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">65.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">67.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">59.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">53.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--/74.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">69.0/78.6</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">BrowseComp-zh</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">76.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">62.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">66.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">60.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">WideSearch</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">76.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">76.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">57.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">72.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">74.0</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">Seal-0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">45.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">47.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">45.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">46.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">57.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">46.9</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">Multilingualism</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MMMLU</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">89.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">88.5</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MMLU-ProX</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">78.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">82.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.7</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">NOVA-63</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">54.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">56.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">56.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">54.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">56.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">59.1</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">INCLUDE</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">82.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.6</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">Global PIQA</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">91.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">89.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">89.8</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">PolyMATH</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">62.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">79.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">64.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">43.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">73.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">WMT24++</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">78.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">79.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">78.9</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MAXIFE</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">88.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">79.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">72.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">88.2</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">Coding Agent</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">SWE-bench Verified</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">76.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">75.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">76.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">76.4</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">SWE-bench Multilingual</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">72.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">65.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">66.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">73.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">69.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">SecCodeBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">62.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">57.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">61.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">Terminal Bench 2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">54.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">59.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">54.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">22.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">50.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">52.5</td> </tr> </tbody> </table> <p style="margin-top:12px;font-size:11px;opacity:0.7"> * HLE-Verified: a verified and revised version of Humanity’s Last Exam (HLE), accompanied by a transparent, component-wise verification protocol and a fine-grained error taxonomy. We open-source the dataset at https://huggingface.co/datasets/skylenage/HLE-Verified.<br> * TAU2-Bench: we follow the official setup except for the airline domain, where all models are evaluated by applying the fixes proposed in the Claude Opus 4.5 system card.<br> * MCPMark: GitHub MCP server uses v0.30.3 from api.githubcopilot.com; Playwright tool responses are truncated at 32k tokens.<br> * Search Agent: most search agents built on our model adopt a simple context-folding strategy(256k): once the cumulative Tool Response length reaches a preset threshold, earlier Tool Responses are pruned from the history to keep the context within limits.<br> * BrowseComp: we tested two strategies, simple context-folding achieved a score of 69.0, while using the same discard-all strategy as DeepSeek-V3.2 and Kimi K2.5 achieved 78.6.<br> * WideSearch: we use a 256k context window without any context management.<br> * MMLU-ProX: we report the averaged accuracy on 29 languages.<br> * WMT24++: a harder subset of WMT24 after difficulty labeling and rebalancing; we report the averaged scores on 55 languages using XCOMET-XXL.<br> * MAXIFE: we report the accuracy on English + multilingual original prompts (totally 23 settings).<br> * Empty cells (--) indicate scores not yet available or not applicable.<br> </p> </div>

Vision Language

<div style="font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,sans-serif;max-width:900px;margin:0 auto;padding:16px 0"> <table style="width:100%;border-collapse:collapse;font-size:13px"> <thead><tr> <th style="padding:10px 12px;text-align:left;font-weight:600;border-bottom:2px solid #7c3aed;color:#7c3aed"></th> <th style="padding:10px 12px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">GPT5.2</th> <th style="padding:10px 12px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">Claude 4.5 Opus</th> <th style="padding:10px 12px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">Gemini-3 Pro</th> <th style="padding:10px 12px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">Qwen3-VL-235B-A22B</th> <th style="padding:10px 12px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">K2.5-1T-A32B</th> <th style="padding:10px 12px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">Qwen3.5-397B-A17B</th> </tr></thead> <tbody> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">STEM and Puzzle</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MMMU</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.0</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MMMU-Pro</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">79.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">69.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">78.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">79.0</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MathVision</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">74.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">74.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">88.6</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">Mathvista(mini)</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">We-Math</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">79.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">74.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.9</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">DynaMath</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">79.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">82.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">ZEROBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">10</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">12</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">ZEROBench_sub</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">33.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">28.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">39.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">28.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">33.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">41.0</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">BabyVision</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">34.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">14.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">49.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">22.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">36.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">52.3/43.3</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">General VQA</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">RealWorldQA</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.9</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MMStar</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">73.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">78.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.8</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">HallusionBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">65.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">64.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">66.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">69.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">71.4</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MMBench<sub><small>EN-DEV-v1.1</small></sub></td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">88.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">89.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">89.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">94.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.7</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">SimpleVQA</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">55.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">65.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">73.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">61.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">71.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">67.1</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">Text Recognition and Document Understanding</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">OmniDocBench1.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">88.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">88.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.8</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">CharXiv(RQ)</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">82.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">66.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.8</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MMLongBench-Doc</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">61.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">60.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">56.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">58.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">61.5</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">CC-OCR</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">76.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">79.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">79.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">82.0</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">AI2D_TEST</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">92.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">94.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">89.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.9</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">OCRBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">92.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.1</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">Spatial Intelligence</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">ERQA</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">59.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">46.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">52.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">67.5</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">CountBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">91.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">97.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">94.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">97.2</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">RefCOCO(avg)</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">91.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">92.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">ODInW13</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">46.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">43.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">47.0</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">EmbSpatialBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">75.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">61.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.5</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">RefSpatialBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">65.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">69.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">73.6</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">LingoQA</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">78.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">72.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">66.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.6</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">V*</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">75.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">67.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">88.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">95.8/91.1</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">Hypersim</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">11.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">12.5</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">SUNRGBD</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">34.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">38.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">Nuscene</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">13.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">16.0</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">Video Understanding</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">VideoMME<sub><small>(w sub.)</sub></small></td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">88.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.5</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">VideoMME<sub><small>(w/o sub.)</sub></small></td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">79.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.7</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">VideoMMMU</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.7</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MLVU (M-Avg)</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.7</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MVBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">78.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">67.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">74.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">75.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">73.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.6</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">LVBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">73.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">57.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">76.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">63.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">75.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">75.5</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MMVU</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">71.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">75.4</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">Visual Agent</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">ScreenSpot Pro</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">45.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">72.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">62.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">65.6</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">OSWorld-Verified</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">38.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">66.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">38.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">63.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">62.2</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">AndroidWorld</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">63.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">66.8</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">Medical VQA</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">SLAKE</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">76.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">76.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">54.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">79.9</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">PMC-VQA</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">58.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">59.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">62.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">41.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">63.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">64.2</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MedXpertQA-MM</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">73.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">63.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">76.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">47.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">65.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.0</td> </tr> </tbody> </table> <p style="margin-top:12px;font-size:11px;opacity:0.7"> * MathVision:our model’s score is evaluated using a fixed prompt, e.g., “Please reason step by step, and put your final answer within \boxed{}.” For other models, we report the higher score between runs with and without the \boxed{} formatting.<br> * BabyVision: our model’s score is reported with CI (Code Interpreter) enabled; without CI, the result is 43.3.<br> * V*: our model’s score is reported with CI (Code Interpreter) enabled; without CI, the result is 91.1.<br> * Empty cells (--) indicate scores not yet available or not applicable.<br> </p> </div>

Quickstart

[!Important] Qwen3.5 models operate in thinking mode by default, generating thinking content signified by <think>\n...</think>\n\n before producing the final responses. To disable thinking content and obtain direct response, refer to the examples here.

For streamlined integration, we recommend using Qwen3.5 via APIs. Below is a guide to use Qwen3.5 via OpenAI-compatible API.

Serving Qwen3.5

Qwen3.5 can be served via APIs with popular inference frameworks. In the following, we show example commands to launch OpenAI-Compatible API servers for Qwen3.5 models.

[!Important] Inference efficiency and throughput vary significantly across frameworks. We recommend using the latest framework versions to ensure optimal performance and compatibility. For production workloads or high-throughput scenarios, dedicated serving engines such as SGLang or vLLM are strongly recommended.

[!Important] The model has a default context length of 262,144 tokens. If you encounter out-of-memory (OOM) errors, consider reducing the context window. However, because Qwen3.5 leverages extended context for complex tasks, we advise maintaining a context length of at least 128K tokens to preserve thinking capabilities.

SGLang

SGLang is a fast serving framework for large language models and vision language models. SGLang from the main branch of the open-source repository is required for Qwen3.5, which can be installed using the following command in a fresh environment:

uv pip install 'git+https://github.com/sgl-project/sglang.git#subdirectory=python&egg=sglang[all]'

See its documentation for more details.

The following will create API endpoints at http://localhost:8000/v1:

  • Standard Version: The following command can be used to create an API endpoint with maximum context length 262,144 tokens using tensor parallel on 8 GPUs.

    python -m sglang.launch_server --model-path Qwen/Qwen3.5-397B-A17B --port 8000 --tp-size 8 --mem-fraction-static 0.8 --context-length 262144 --reasoning-parser qwen3
    
  • Tool Use: To support tool use, you can use the following command.

    python -m sglang.launch_server --model-path Qwen/Qwen3.5-397B-A17B --port 8000 --tp-size 8 --mem-fraction-static 0.8 --context-length 262144 --reasoning-parser qwen3 --tool-call-parser qwen3_coder
    
  • Multi-Token Prediction (MTP): The following command is recommended for MTP:

    python -m sglang.launch_server --model-path Qwen/Qwen3.5-397B-A17B --port 8000 --tp-size 8 --mem-fraction-static 0.8 --context-length 262144 --reasoning-parser qwen3 --speculative-algo NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4
    

vLLM

vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. vLLM from the main branch of the open-source repository is required for Qwen3.5, which can be installed using the following command in a fresh environment:

uv pip install vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly

See its documentation for more details.

For detailed Qwen3.5 usage guide, see the vLLM Qwen3.5 recipe.

The following will create API endpoints at http://localhost:8000/v1:

  • Standard Version: The following command can be used to create an API endpoint with maximum context length 262,144 tokens using tensor parallel on 8 GPUs.

    vllm serve Qwen/Qwen3.5-397B-A17B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 
    
  • Tool Call: To support tool use, you can use the following command.

    vllm serve Qwen/Qwen3.5-397B-A17B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder 
    
  • Multi-Token Prediction (MTP): The following command is recommended for MTP:

    vllm serve Qwen/Qwen3.5-397B-A17B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
    
  • Text-Only: The following command skips the vision encoder and multimodal profiling to free up memory for additional KV cache:

    vllm serve Qwen/Qwen3.5-397B-A17B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 --language-model-only
    

Hugging Face Transformers

Hugging Face Transformers contains a lightweight server which can be used for quick testing and moderate load deployment. The latest transformers is required for Qwen3.5:

pip install "transformers[serving] @ git+https://github.com/huggingface/transformers.git@main"

See its documentation for more details. Please also make sure torchvision and pillow are installed.

Then, run transformers serve to launch a server with API endpoints at http://localhost:8000/v1; it will place the model on accelerators if available:

transformers serve --force-model Qwen/Qwen3.5-397B-A17B --port 8000 --continuous-batching

Using Qwen3.5 via the Chat Completions API

The chat completions API is accessible via standard HTTP requests or OpenAI SDKs. Here, we show examples using the OpenAI Python SDK.

Before starting, make sure it is installed and the API key and the API base URL is configured, e.g.:

pip install -U openai

# Set the following accordingly
export OPENAI_BASE_URL="http://localhost:8000/v1"
export OPENAI_API_KEY="EMPTY"

[!Tip] We recommend using the following set of sampling parameters for generation

  • Thinking mode: temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0
  • Instruct (or non-thinking) mode: temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

Please note that the support for sampling parameters varies according to inference frameworks.

Text-Only Input

from openai import OpenAI
# Configured by environment variables
client = OpenAI()

messages = [
    {"role": "user", "content": "Type \"I love Qwen3.5\" backwards"},
]

chat_response = client.chat.completions.create(
    model="Qwen/Qwen3.5-397B-A17B",
    messages=messages,
    max_tokens=81920,
    temperature=0.6,
    top_p=0.95,
    extra_body={
        "top_k": 20,
    }, 
)
print("Chat response:", chat_response)

Image Input

from openai import OpenAI
# Configured by environment variables
client = OpenAI()

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.5/demo/CI_Demo/mathv-1327.jpg"
                }
            },
            {
                "type": "text",
                "text": "The centres of the four illustrated circles are in the corners of the square. The two big circles touch each other and also the two little circles. With which factor do you have to multiply the radii of the little circles to obtain the radius of the big circles?\nChoices:\n(A) $\\frac{2}{9}$\n(B) $\\sqrt{5}$\n(C) $0.8 \\cdot \\pi$\n(D) 2.5\n(E) $1+\\sqrt{2}$"
            }
        ]
    }
]

response = client.chat.completions.create(
    model="Qwen/Qwen3.5-397B-A17B",
    messages=messages,
    max_tokens=81920,
    temperature=0.6,
    top_p=0.95,
    extra_body={
        "top_k": 20,
    }, 
)
print("Chat response:", chat_response)

Video Input

from openai import OpenAI
# Configured by environment variables
client = OpenAI()

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video_url",
                "video_url": {
                    "url": "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.5/demo/video/N1cdUjctpG8.mp4"
                }
            },
            {
                "type": "text",
                "text": "How many porcelain jars were discovered in the niches located in the primary chamber of the tomb?"
            }
        ]
    }
]

# When vLLM is launched with `--media-io-kwargs '{"video": {"num_frames": -1}}'`,
# video frame sampling can be configured via `extra_body` (e.g., by setting `fps`).
# This feature is currently supported only in vLLM.
#
# By default, `fps=2` and `do_sample_frames=True`.
# With `do_sample_frames=True`, you can customize the `fps` value to set your desired video sampling rate.
response = client.chat.completions.create(
    model="Qwen/Qwen3.5-397B-A17B",
    messages=messages,
    max_tokens=81920,
    temperature=0.6,
    top_p=0.95,
    extra_body={
        "top_k": 20,
        "mm_processor_kwargs": {"fps": 2, "do_sample_frames": True},
    }, 
)

print("Chat response:", chat_response)

Instruct (or Non-Thinking) Mode

[!Important] Qwen3.5 does not officially support the soft switch of Qwen3, i.e., /think and /nothink.

Qwen3.5 will think by default before response. You can obtain direct response from the model without thinking by configuring the API parameters. For example,

from openai import OpenAI
# Configured by environment variables
client = OpenAI()

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.5/demo/RealWorld/RealWorld-04.png"
                }
            },
            {
                "type": "text",
                "text": "Where is this?"
            }
        ]
    }
]

chat_response = client.chat.completions.create(
    model="Qwen/Qwen3.5-397B-A17B",
    messages=messages,
    max_tokens=32768,
    temperature=0.7,
    top_p=0.8,
    presence_penalty=1.5,
    extra_body={
        "top_k": 20,
        "chat_template_kwargs": {"enable_thinking": False},
    }, 
)
print("Chat response:", chat_response)

[!Note] If you are using APIs from Alibaba Cloud Model Studio, in addition to changing model, please use "enable_thinking": False instead of "chat_template_kwargs": {"enable_thinking": False}.

Agentic Usage

Qwen3.5 excels in tool calling capabilities.

Qwen-Agent

We recommend using Qwen-Agent to quickly build Agent applications with Qwen3.5.

To define the available tools, you can use the MCP configuration file, use the integrated tool of Qwen-Agent, or integrate other tools by yourself.

import os
from qwen_agent.agents import Assistant

# Define LLM
# Using Alibaba Cloud Model Studio
llm_cfg = {
    # Use the OpenAI-compatible model service provided by DashScope:
    'model': 'Qwen3.5-397B-A17B',
    'model_type': 'qwenvl_oai',
    'model_server': 'https://dashscope.aliyuncs.com/compatible-mode/v1',
    'api_key': os.getenv('DASHSCOPE_API_KEY'),

    'generate_cfg': {
        'use_raw_api': True,
        # When using Dash Scope OAI API, pass the parameter of whether to enable thinking mode in this way
        'extra_body': {
            'enable_thinking': True
        },
    },
}

# Using OpenAI-compatible API endpoint.
# functionality of the deployment frameworks and let Qwen-Agent automate the related operations.
#
# llm_cfg = {
#     # Use your own model service compatible with OpenAI API by vLLM/SGLang:
#     'model': 'Qwen/Qwen3.5-397B-A17B',
#     'model_type': 'qwenvl_oai',
#     'model_server': 'http://localhost:8000/v1',  # api_base
#     'api_key': 'EMPTY',
#
#     'generate_cfg': {
#         'use_raw_api': True,
#         # When using vLLM/SGLang OAI API, pass the parameter of whether to enable thinking mode in this way
#         'extra_body': {
#             'chat_template_kwargs': {'enable_thinking': True}
#         },
#     },
# }

# Define Tools
tools = [
    {'mcpServers': {  # You can specify the MCP configuration file
            "filesystem": {
                "command": "npx",
                "args": ["-y", "@modelcontextprotocol/server-filesystem", "/Users/xxxx/Desktop"]
            }
        }
    }
]

# Define Agent
bot = Assistant(llm=llm_cfg, function_list=tools)

# Streaming generation
messages = [{'role': 'user', 'content': 'Help me organize my desktop.'}]
for responses in bot.run(messages=messages):
    pass
print(responses)

# Streaming generation
messages = [{'role': 'user', 'content': 'Develop a dog website and save it on the desktop'}]
for responses in bot.run(messages=messages):
    pass
print(responses)

Qwen Code

Qwen Code is an open-source AI agent for the terminal, optimized for Qwen models. It helps you understand large codebases, automate tedious work, and ship faster.

For more information, please refer to Qwen Code.

Processing Ultra-Long Texts

Qwen3.5 natively supports context lengths of up to 262,144 tokens. For long-horizon tasks where the total length (including both input and output) exceeds this limit, we recommend using RoPE scaling techniques to handle long texts effectively., e.g., YaRN.

YaRN is currently supported by several inference frameworks, e.g., transformers, vllm and sglang. In general, there are two approaches to enabling YaRN for supported frameworks:

  • Modifying the model configuration file: In the config.json file, change the rope_parameters fields in text_config to:

    {
        "mrope_interleaved": true,
        "mrope_section": [
            11,
            11,
            10
        ],
        "rope_type": "yarn",
        "rope_theta": 10000000,
        "partial_rotary_factor": 0.25,
        "factor": 4.0,
        "original_max_position_embeddings": 262144,
    }
    
  • Passing command line arguments:

    For vllm, you can use

    VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve ... --hf-overrides '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}' --max-model-len 1010000  
    

    For sglang, you can use

    SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python -m sglang.launch_server ... --json-model-override-args '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}' --context-length 1010000
    

[!NOTE] All the notable open-source frameworks implement static YaRN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise modifying the rope_parameters configuration only when processing long contexts is required. It is also recommended to modify the factor as needed. For example, if the typical context length for your application is 524,288 tokens, it would be better to set factor as 2.0.

Best Practices

To achieve optimal performance, we recommend the following settings:

  1. Sampling Parameters:

    • We suggest using Temperature=0.6, TopP=0.95, TopK=20, and MinP=0 for thinking mode and using Temperature=0.7, TopP=0.8, TopK=20, and MinP=0 for non-thinking mode.
    • For supported frameworks, you can adjust the presence_penalty parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance.
  2. Adequate Output Length: We recommend using an output length of 32,768 tokens for most queries. For benchmarking on highly complex problems, such as those found in math and programming competitions, we suggest setting the max output length to 81,920 tokens. This provides the model with sufficient space to generate detailed and comprehensive responses, thereby enhancing its overall performance.

  3. Standardize Output Format: We recommend using prompts to standardize model outputs when benchmarking.

    • Math Problems: Include "Please reason step by step, and put your final answer within \boxed{}." in the prompt.
    • Multiple-Choice Questions: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the answer field with only the choice letter, e.g., "answer": "C"."
  4. No Thinking Content in History: In multi-turn conversations, the historical model output should only include the final output part and does not need to include the thinking content. It is implemented in the provided chat template in Jinja2. However, for frameworks that do not directly use the Jinja2 chat template, it is up to the developers to ensure that the best practice is followed.

  5. Long Video Understanding: To optimize inference efficiency for plain text and images, the size parameter in the released video_preprocessor_config.json is conservatively configured. It is recommended to set the longest_edge parameter in the video_preprocessor_config file to 469,762,048 (corresponding to 224k video tokens) to enable higher frame-rate sampling for hour-scale videos and thereby achieve superior performance. For example,

    {"longest_edge": 469762048, "shortest_edge": 4096}
    

    Alternatively, override the default values via engine startup parameters. For implementation details, refer to: vLLM / SGLang.

Citation

If you find our work helpful, feel free to give us a cite.

@misc{qwen3.5
    title  = {{Qwen3.5}: Towards Native Multimodal Agents},
    author = {{Qwen Team}},
    month  = {February},
    year   = {2026},
    url    = {https://qwen.ai/blog?id=qwen3.5}
}

Author: Qwen

Likes: 439

Downloads: 0

Tags: transformers, safetensors, qwen3_5_moe, image-text-to-text, conversational, license:apache-2.0, eval-results, endpoints_compatible, region:us

unsloth/Qwen3.5-397B-A17B-GGUF


tags:

  • unsloth
  • qwen3_5_moe base_model:
  • Qwen/Qwen3.5-397B-A17B library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3.5-397B-A17B/blob/main/LICENSE pipeline_tag: image-text-to-text

<div> <p style="margin-bottom: 0; margin-top: 0;"> <h1 style="margin-top: 0rem;">To run Qwen3.5 locally - <a href="https://docs.unsloth.ai/models/qwen3.5">Read our Guide!</a></h1> </p> <p style="margin-top: 0;margin-bottom: 0;"> <em><a href="https://docs.unsloth.ai/basics/unsloth-dynamic-v2.0-gguf">Unsloth Dynamic 2.0</a> achieves superior accuracy & outperforms other leading quants.</em> </p> <div style="margin-top: 0;display: flex; gap: 5px; align-items: center; "> <a href="https://github.com/unslothai/unsloth/"> <img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="133"> </a> <a href="https://discord.gg/unsloth"> <img src="https://github.com/unslothai/unsloth/raw/main/images/Discord%20button.png" width="173"> </a> <a href="https://docs.unsloth.ai/models/qwen3.5"> <img src="https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/documentation%20green%20button.png" width="143"> </a> </div> </div>

You can follow instructions in our guide here.


Qwen3.5-397B-A17B

<img width="400px" src="https://qianwen-res.oss-accelerate.aliyuncs.com/logo_qwen3.5.png">

Qwen Chat

[!Note] This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format.

These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, etc.

[!Tip] For users seeking managed, scalable inference without infrastructure maintenance, the official Qwen API service is provided by Alibaba Cloud Model Studio.

In particular, Qwen3.5-Plus is the hosted version corresponding to Qwen3.5-397B-A17B with more production features, e.g., 1M context length by default, official built-in tools, and adaptive tool use. For more information, please refer to the User Guide.

Over recent months, we have intensified our focus on developing foundation models that deliver exceptional utility and performance. Qwen3.5 represents a significant leap forward, integrating breakthroughs in multimodal learning, architectural efficiency, reinforcement learning scale, and global accessibility to empower developers and enterprises with unprecedented capability and efficiency.

Qwen3.5 Highlights

Qwen3.5 features the following enhancement:

  • Unified Vision-Language Foundation: Early fusion training on multimodal tokens achieves cross-generational parity with Qwen3 and outperforms Qwen3-VL models across reasoning, coding, agents, and visual understanding benchmarks.

  • Efficient Hybrid Architecture: Gated Delta Networks combined with sparse Mixture-of-Experts deliver high-throughput inference with minimal latency and cost overhead.

  • Scalable RL Generalization: Reinforcement learning scaled across million-agent environments with progressively complex task distributions for robust real-world adaptability.

  • Global Linguistic Coverage: Expanded support to 201 languages and dialects, enabling inclusive, worldwide deployment with nuanced cultural and regional understanding.

  • Next-Generation Training Infrastructure: Near-100% multimodal training efficiency compared to text-only training and asynchronous RL frameworks supporting massive-scale agent scaffolds and environment orchestration.

Benchmark Results

For more details, please refer to our blog post Qwen3.5.

Model Overview

  • Type: Causal Language Model with Vision Encoder
  • Training Stage: Pre-training & Post-training
  • Language Model
    • Number of Parameters: 397B in total and 17B activated
    • Hidden Dimension: 4096
    • Token Embedding: 248320 (Padded)
    • Number of Layers: 60
      • Hidden Layout: 15 * (3 * (Gated DeltaNet -> MoE) -> 1 * (Gated Attention -> MoE))
    • Gated DeltaNet:
      • Number of Linear Attention Heads: 64 for V and 16 for QK
      • Head Dimension: 128
    • Gated Attention:
      • Number of Attention Heads: 32 for Q and 2 for KV
      • Head Dimension: 256
      • Rotary Position Embedding Dimension: 64
    • Mixture Of Experts
      • Number of Experts: 512
      • Number of Activated Experts: 10 Routed + 1 Shared
      • Expert Intermediate Dimension: 1024
    • LM Output: 248320 (Padded)
    • MTP: trained with multi-steps
  • Context Length: 262,144 natively and extensible up to 1,010,000 tokens.

Benchmark Results

Language

<div style="font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,sans-serif;max-width:900px;margin:0 auto;padding:16px 0"> <table style="width:100%;border-collapse:collapse;font-size:13px"> <thead><tr> <th style="padding:10px 12px;text-align:left;font-weight:600;border-bottom:2px solid #7c3aed;color:#7c3aed"></th> <th style="padding:10px 12px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">GPT5.2</th> <th style="padding:10px 12px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">Claude 4.5 Opus</th> <th style="padding:10px 12px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">Gemini-3 Pro</th> <th style="padding:10px 12px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">Qwen3-Max-Thinking</th> <th style="padding:10px 12px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">K2.5-1T-A32B</th> <th style="padding:10px 12px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">Qwen3.5-397B-A17B</th> </tr></thead> <tbody> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">Knowledge</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MMLU-Pro</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">89.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">89.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.8</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MMLU-Redux</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">95.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">95.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">95.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">92.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">94.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">94.9</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">SuperGPQA</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">67.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">74.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">67.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">69.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.4</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">C-Eval</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">92.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">94.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.0</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">Instruction Following</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">IFEval</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">94.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">92.6</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">IFBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">75.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">58.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">76.5</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MultiChallenge</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">57.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">54.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">64.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">63.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">62.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">67.6</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">Long Context</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">AA-LCR</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">72.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">74.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.7</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">LongBench v2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">54.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">64.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">60.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">61.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">63.2</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">STEM</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">GPQA</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">92.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">91.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">88.4</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">HLE</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">35.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">30.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">37.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">30.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">30.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">28.7</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">HLE-Verified¹</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">43.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">38.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">48</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">37.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">37.6</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">Reasoning</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">LiveCodeBench v6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.6</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">HMMT Feb 25</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">99.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">92.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">97.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">98.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">95.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">94.8</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">HMMT Nov 25</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">100</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">94.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">91.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">92.7</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">IMOAnswerBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.9</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">AIME26</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">96.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">91.3</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">General Agent</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">BFCL-V4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">63.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">72.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">67.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">72.9</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">TAU2-Bench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">91.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.7</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">VITA-Bench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">38.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">56.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">51.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">40.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">41.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">49.7</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">DeepPlanning</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">44.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">33.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">23.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">28.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">14.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">34.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">Tool Decathlon</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">43.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">43.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">36.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">18.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">27.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">38.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MCP-Mark</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">57.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">42.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">53.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">33.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">29.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">46.1</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">Search Agent³</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">HLE w/ tool</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">45.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">43.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">45.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">49.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">50.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">48.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">BrowseComp</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">65.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">67.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">59.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">53.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--/74.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">69.0/78.6</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">BrowseComp-zh</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">76.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">62.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">66.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">60.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">WideSearch</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">76.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">76.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">57.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">72.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">74.0</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">Seal-0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">45.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">47.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">45.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">46.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">57.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">46.9</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">Multilingualism</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MMMLU</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">89.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">88.5</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MMLU-ProX</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">78.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">82.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.7</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">NOVA-63</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">54.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">56.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">56.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">54.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">56.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">59.1</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">INCLUDE</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">82.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.6</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">Global PIQA</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">91.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">89.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">89.8</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">PolyMATH</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">62.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">79.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">64.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">43.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">73.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">WMT24++</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">78.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">79.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">78.9</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MAXIFE</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">88.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">79.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">72.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">88.2</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">Coding Agent</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">SWE-bench Verified</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">76.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">75.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">76.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">76.4</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">SWE-bench Multilingual</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">72.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">65.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">66.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">73.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">69.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">SecCodeBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">62.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">57.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">61.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">Terminal Bench 2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">54.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">59.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">54.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">22.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">50.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">52.5</td> </tr> </tbody> </table> <p style="margin-top:12px;font-size:11px;opacity:0.7"> * HLE-Verified: a verified and revised version of Humanity’s Last Exam (HLE), accompanied by a transparent, component-wise verification protocol and a fine-grained error taxonomy. We open-source the dataset at https://huggingface.co/datasets/skylenage/HLE-Verified.<br> * TAU2-Bench: we follow the official setup except for the airline domain, where all models are evaluated by applying the fixes proposed in the Claude Opus 4.5 system card.<br> * MCPMark: GitHub MCP server uses v0.30.3 from api.githubcopilot.com; Playwright tool responses are truncated at 32k tokens.<br> * Search Agent: most search agents built on our model adopt a simple context-folding strategy(256k): once the cumulative Tool Response length reaches a preset threshold, earlier Tool Responses are pruned from the history to keep the context within limits.<br> * BrowseComp: we tested two strategies, simple context-folding achieved a score of 69.0, while using the same discard-all strategy as DeepSeek-V3.2 and Kimi K2.5 achieved 78.6.<br> * WideSearch: we use a 256k context window without any context management.<br> * MMLU-ProX: we report the averaged accuracy on 29 languages.<br> * WMT24++: a harder subset of WMT24 after difficulty labeling and rebalancing; we report the averaged scores on 55 languages using XCOMET-XXL.<br> * MAXIFE: we report the accuracy on English + multilingual original prompts (totally 23 settings).<br> * Empty cells (--) indicate scores not yet available or not applicable.<br> </p> </div>

Vision Language

<div style="font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,sans-serif;max-width:900px;margin:0 auto;padding:16px 0"> <table style="width:100%;border-collapse:collapse;font-size:13px"> <thead><tr> <th style="padding:10px 12px;text-align:left;font-weight:600;border-bottom:2px solid #7c3aed;color:#7c3aed"></th> <th style="padding:10px 12px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">GPT5.2</th> <th style="padding:10px 12px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">Claude 4.5 Opus</th> <th style="padding:10px 12px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">Gemini-3 Pro</th> <th style="padding:10px 12px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">Qwen3-VL-235B-A22B</th> <th style="padding:10px 12px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">K2.5-1T-A32B</th> <th style="padding:10px 12px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">Qwen3.5-397B-A17B</th> </tr></thead> <tbody> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">STEM and Puzzle</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MMMU</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.0</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MMMU-Pro</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">79.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">69.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">78.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">79.0</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MathVision</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">74.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">74.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">88.6</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">Mathvista(mini)</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">We-Math</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">79.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">74.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.9</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">DynaMath</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">79.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">82.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">ZEROBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">10</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">12</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">ZEROBench_sub</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">33.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">28.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">39.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">28.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">33.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">41.0</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">BabyVision</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">34.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">14.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">49.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">22.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">36.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">52.3/43.3</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">General VQA</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">RealWorldQA</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.9</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MMStar</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">73.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">78.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.8</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">HallusionBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">65.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">64.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">66.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">69.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">71.4</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MMBench<sub><small>EN-DEV-v1.1</small></sub></td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">88.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">89.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">89.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">94.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.7</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">SimpleVQA</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">55.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">65.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">73.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">61.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">71.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">67.1</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">Text Recognition and Document Understanding</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">OmniDocBench1.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">88.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">88.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.8</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">CharXiv(RQ)</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">82.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">66.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.8</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MMLongBench-Doc</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">61.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">60.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">56.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">58.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">61.5</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">CC-OCR</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">76.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">79.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">79.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">82.0</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">AI2D_TEST</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">92.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">94.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">89.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.9</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">OCRBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">92.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.1</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">Spatial Intelligence</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">ERQA</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">59.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">46.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">52.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">67.5</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">CountBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">91.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">97.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">94.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">97.2</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">RefCOCO(avg)</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">91.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">92.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">ODInW13</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">46.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">43.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">47.0</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">EmbSpatialBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">75.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">61.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.5</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">RefSpatialBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">65.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">69.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">73.6</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">LingoQA</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">78.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">72.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">66.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.6</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">V*</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">75.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">67.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">88.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">95.8/91.1</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">Hypersim</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">11.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">12.5</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">SUNRGBD</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">34.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">38.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">Nuscene</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">13.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">16.0</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">Video Understanding</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">VideoMME<sub><small>(w sub.)</sub></small></td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">88.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.5</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">VideoMME<sub><small>(w/o sub.)</sub></small></td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">79.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.7</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">VideoMMMU</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.7</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MLVU (M-Avg)</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.7</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MVBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">78.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">67.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">74.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">75.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">73.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.6</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">LVBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">73.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">57.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">76.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">63.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">75.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">75.5</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MMVU</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">71.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">75.4</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">Visual Agent</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">ScreenSpot Pro</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">45.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">72.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">62.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">65.6</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">OSWorld-Verified</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">38.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">66.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">38.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">63.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">62.2</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">AndroidWorld</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">63.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">66.8</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">Medical VQA</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">SLAKE</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">76.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">76.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">54.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">79.9</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">PMC-VQA</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">58.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">59.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">62.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">41.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">63.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">64.2</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MedXpertQA-MM</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">73.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">63.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">76.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">47.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">65.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.0</td> </tr> </tbody> </table> <p style="margin-top:12px;font-size:11px;opacity:0.7"> * MathVision:our model’s score is evaluated using a fixed prompt, e.g., “Please reason step by step, and put your final answer within \boxed{}.” For other models, we report the higher score between runs with and without the \boxed{} formatting.<br> * BabyVision: our model’s score is reported with CI (Code Interpreter) enabled; without CI, the result is 43.3.<br> * V*: our model’s score is reported with CI (Code Interpreter) enabled; without CI, the result is 91.1.<br> * Empty cells (--) indicate scores not yet available or not applicable.<br> </p> </div>

Quickstart

[!Important] Qwen3.5 models operate in thinking mode by default, generating thinking content signified by <think>\n...</think>\n\n before producing the final responses. To disable thinking content and obtain direct response, refer to the examples here.

For streamlined integration, we recommend using Qwen3.5 via APIs. Below is a guide to use Qwen3.5 via OpenAI-compatible API.

Serving Qwen3.5

Qwen3.5 can be served via APIs with popular inference frameworks. In the following, we show example commands to launch OpenAI-Compatible API servers for Qwen3.5 models.

[!Important] Inference efficiency and throughput vary significantly across frameworks. We recommend using the latest framework versions to ensure optimal performance and compatibility. For production workloads or high-throughput scenarios, dedicated serving engines such as SGLang or vLLM are strongly recommended.

[!Important] The model has a default context length of 262,144 tokens. If you encounter out-of-memory (OOM) errors, consider reducing the context window. However, because Qwen3.5 leverages extended context for complex tasks, we advise maintaining a context length of at least 128K tokens to preserve thinking capabilities.

SGLang

SGLang is a fast serving framework for large language models and vision language models. SGLang from the main branch of the open-source repository is required for Qwen3.5, which can be installed using the following command in a fresh environment:

uv pip install 'git+https://github.com/sgl-project/sglang.git#subdirectory=python&egg=sglang[all]'

See its documentation for more details.

The following will create API endpoints at http://localhost:8000/v1:

  • Standard Version: The following command can be used to create an API endpoint with maximum context length 262,144 tokens using tensor parallel on 8 GPUs.

    python -m sglang.launch_server --model-path Qwen/Qwen3.5-397B-A17B --port 8000 --tp-size 8 --mem-fraction-static 0.8 --context-length 262144 --reasoning-parser qwen3
    
  • Tool Use: To support tool use, you can use the following command.

    python -m sglang.launch_server --model-path Qwen/Qwen3.5-397B-A17B --port 8000 --tp-size 8 --mem-fraction-static 0.8 --context-length 262144 --reasoning-parser qwen3 --tool-call-parser qwen3_coder
    
  • Multi-Token Prediction (MTP): The following command is recommended for MTP:

    python -m sglang.launch_server --model-path Qwen/Qwen3.5-397B-A17B --port 8000 --tp-size 8 --mem-fraction-static 0.8 --context-length 262144 --reasoning-parser qwen3 --speculative-algo NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4
    

vLLM

vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. vLLM from the main branch of the open-source repository is required for Qwen3.5, which can be installed using the following command in a fresh environment:

uv pip install vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly

See its documentation for more details.

For detailed Qwen3.5 usage guide, see the vLLM Qwen3.5 recipe.

The following will create API endpoints at http://localhost:8000/v1:

  • Standard Version: The following command can be used to create an API endpoint with maximum context length 262,144 tokens using tensor parallel on 8 GPUs.

    vllm serve Qwen/Qwen3.5-397B-A17B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 
    
  • Tool Call: To support tool use, you can use the following command.

    vllm serve Qwen/Qwen3.5-397B-A17B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder 
    
  • Multi-Token Prediction (MTP): The following command is recommended for MTP:

    vllm serve Qwen/Qwen3.5-397B-A17B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
    
  • Text-Only: The following command skips the vision encoder and multimodal profiling to free up memory for additional KV cache:

    vllm serve Qwen/Qwen3.5-397B-A17B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 --language-model-only
    

Hugging Face Transformers

Hugging Face Transformers contains a lightweight server which can be used for quick testing and moderate load deployment. The latest transformers is required for Qwen3.5:

pip install "transformers[serving] @ git+https://github.com/huggingface/transformers.git@main"

See its documentation for more details.

Then, run transformers serve to launch a server with API endpoints at http://localhost:8000/v1; it will place the model on accelerators if available:

transformers serve --force-model Qwen/Qwen3.5-397B-A17B --port 8000 --continuous-batching

Using Qwen3.5 via the Chat Completions API

The chat completions API is accessible via standard HTTP requests or OpenAI SDKs. Here, we show examples using the OpenAI Python SDK.

Before starting, make sure it is installed and the API key and the API base URL is configured, e.g.:

pip install -U openai

# Set the following accordingly
export OPENAI_BASE_URL="http://localhost:8000/v1"
export OPENAI_API_KEY="EMPTY"

[!Tip] We recommend using the following set of sampling parameters for generation

  • Thinking mode: temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0
  • Instruct (or non-thinking) mode: temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

Text-Only Input

from openai import OpenAI
# Configured by environment variables
client = OpenAI()

messages = [
    {"role": "user", "content": "Type \"I love Qwen3.5\" backwards"},
]

chat_response = client.chat.completions.create(
    model="Qwen/Qwen3.5-397B-A17B",
    messages=messages,
    max_tokens=81920,
    temperature=0.6,
    top_p=0.95,
    extra_body={
        "top_k": 20,
    }, 
)
print("Chat response:", chat_response)

Image Input

from openai import OpenAI
# Configured by environment variables
client = OpenAI()

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.5/demo/CI_Demo/mathv-1327.jpg"
                }
            },
            {
                "type": "text",
                "text": "The centres of the four illustrated circles are in the corners of the square. The two big circles touch each other and also the two little circles. With which factor do you have to multiply the radii of the little circles to obtain the radius of the big circles?\nChoices:\n(A) $\\frac{2}{9}$\n(B) $\\sqrt{5}$\n(C) $0.8 \\cdot \\pi$\n(D) 2.5\n(E) $1+\\sqrt{2}$"
            }
        ]
    }
]

response = client.chat.completions.create(
    model="Qwen/Qwen3.5-397B-A17B",
    messages=messages,
    max_tokens=81920,
    temperature=0.6,
    top_p=0.95,
    extra_body={
        "top_k": 20,
    }, 
)
print("Chat response:", chat_response)

Video Input

from openai import OpenAI
# Configured by environment variables
client = OpenAI()

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video_url",
                "video_url": {
                    "url": "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.5/demo/video/N1cdUjctpG8.mp4"
                }
            },
            {
                "type": "text",
                "text": "How many porcelain jars were discovered in the niches located in the primary chamber of the tomb?"
            }
        ]
    }
]

# When vLLM is launched with `--media-io-kwargs '{"video": {"num_frames": -1}}'`,
# video frame sampling can be configured via `extra_body` (e.g., by setting `fps`).
# This feature is currently supported only in vLLM.
#
# By default, `fps=2` and `do_sample_frames=True`.
# With `do_sample_frames=True`, you can customize the `fps` value to set your desired video sampling rate.
response = client.chat.completions.create(
    model="Qwen/Qwen3.5-397B-A17B",
    messages=messages,
    max_tokens=81920,
    temperature=0.6,
    top_p=0.95,
    extra_body={
        "top_k": 20,
        "mm_processor_kwargs": {"fps": 2, "do_sample_frames": True},
    }, 
)

print("Chat response:", chat_response)

Instruct (or Non-Thinking) Mode

[!Important] Qwen3.5 does not officially support the soft switch of Qwen3, i.e., /think and /nothink.

Qwen3.5 will think by default before response. You can obtain direct response from the model without thinking by configuring the API parameters. For example,

from openai import OpenAI
# Configured by environment variables
client = OpenAI()

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.5/demo/RealWorld/RealWorld-04.png"
                }
            },
            {
                "type": "text",
                "text": "Where is this?"
            }
        ]
    }
]

chat_response = client.chat.completions.create(
    model="Qwen/Qwen3.5-397B-A17B",
    messages=messages,
    max_tokens=32768,
    temperature=0.7,
    top_p=0.8,
    presence_penalty=1.5,
    extra_body={
        "top_k": 20,
        "chat_template_kwargs": {"enable_thinking": False},
    }, 
)
print("Chat response:", chat_response)

[!Note] If you are using APIs from Alibaba Cloud Model Studio, in addition to changing model, please use "enable_thinking": False instead of "chat_template_kwargs": {"enable_thinking": False}.

Agentic Usage

Qwen3.5 excels in tool calling capabilities.

Qwen-Agent

We recommend using Qwen-Agent to quickly build Agent applications with Qwen3.5.

To define the available tools, you can use the MCP configuration file, use the integrated tool of Qwen-Agent, or integrate other tools by yourself.

import os
from qwen_agent.agents import Assistant

# Define LLM
# Using Alibaba Cloud Model Studio
llm_cfg = {
    # Use the OpenAI-compatible model service provided by DashScope:
    'model': 'Qwen3.5-397B-A17B',
    'model_type': 'qwenvl_oai',
    'model_server': 'https://dashscope.aliyuncs.com/compatible-mode/v1',
    'api_key': os.getenv('DASHSCOPE_API_KEY'),

    'generate_cfg': {
        'use_raw_api': True,
        # When using Dash Scope OAI API, pass the parameter of whether to enable thinking mode in this way
        'extra_body': {
            'enable_thinking': True
        },
    },
}

# Using OpenAI-compatible API endpoint.
# functionality of the deployment frameworks and let Qwen-Agent automate the related operations.
#
# llm_cfg = {
#     # Use your own model service compatible with OpenAI API by vLLM/SGLang:
#     'model': 'Qwen/Qwen3.5-397B-A17B',
#     'model_type': 'qwenvl_oai',
#     'model_server': 'http://localhost:8000/v1',  # api_base
#     'api_key': 'EMPTY',
#
#     'generate_cfg': {
#         'use_raw_api': True,
#         # When using vLLM/SGLang OAI API, pass the parameter of whether to enable thinking mode in this way
#         'extra_body': {
#             'chat_template_kwargs': {'enable_thinking': True}
#         },
#     },
# }

# Define Tools
tools = [
    {'mcpServers': {  # You can specify the MCP configuration file
            "filesystem": {
                "command": "npx",
                "args": ["-y", "@modelcontextprotocol/server-filesystem", "/Users/xxxx/Desktop"]
            }
        }
    }
]

# Define Agent
bot = Assistant(llm=llm_cfg, function_list=tools)

# Streaming generation
messages = [{'role': 'user', 'content': 'Help me organize my desktop.'}]
for responses in bot.run(messages=messages):
    pass
print(responses)

# Streaming generation
messages = [{'role': 'user', 'content': 'Develop a dog website and save it on the desktop'}]
for responses in bot.run(messages=messages):
    pass
print(responses)

Qwen Code

Qwen Code is an open-source AI agent for the terminal, optimized for Qwen models. It helps you understand large codebases, automate tedious work, and ship faster.

For more information, please refer to Qwen Code.

Processing Ultra-Long Texts

Qwen3.5 natively supports context lengths of up to 262,144 tokens. For long-horizon tasks where the total length (including both input and output) exceeds this limit, we recommend using RoPE scaling techniques to handle long texts effectively., e.g., YaRN.

YaRN is currently supported by several inference frameworks, e.g., transformers, vllm and sglang. In general, there are two approaches to enabling YaRN for supported frameworks:

  • Modifying the model configuration file: In the config.json file, change the rope_parameters fields in text_config to:

    {
        "mrope_interleaved": true,
        "mrope_section": [
            11,
            11,
            10
        ],
        "rope_type": "yarn",
        "rope_theta": 10000000,
        "partial_rotary_factor": 0.25,
        "factor": 4.0,
        "original_max_position_embeddings": 262144,
    }
    
  • Passing command line arguments:

    For vllm, you can use

    VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve ... --hf-overrides '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}' --max-model-len 1010000  
    

    For sglang, you can use

    SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python -m sglang.launch_server ... --json-model-override-args '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}' --context-length 1010000
    

[!NOTE] All the notable open-source frameworks implement static YaRN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise modifying the rope_parameters configuration only when processing long contexts is required. It is also recommended to modify the factor as needed. For example, if the typical context length for your application is 524,288 tokens, it would be better to set factor as 2.0.

Best Practices

To achieve optimal performance, we recommend the following settings:

  1. Sampling Parameters:

    • We suggest using Temperature=0.6, TopP=0.95, TopK=20, and MinP=0 for thinking mode and using Temperature=0.7, TopP=0.8, TopK=20, and MinP=0 for non-thinking mode.
    • For supported frameworks, you can adjust the presence_penalty parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance.
  2. Adequate Output Length: We recommend using an output length of 32,768 tokens for most queries. For benchmarking on highly complex problems, such as those found in math and programming competitions, we suggest setting the max output length to 81,920 tokens. This provides the model with sufficient space to generate detailed and comprehensive responses, thereby enhancing its overall performance.

  3. Standardize Output Format: We recommend using prompts to standardize model outputs when benchmarking.

    • Math Problems: Include "Please reason step by step, and put your final answer within \boxed{}." in the prompt.
    • Multiple-Choice Questions: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the answer field with only the choice letter, e.g., "answer": "C"."
  4. No Thinking Content in History: In multi-turn conversations, the historical model output should only include the final output part and does not need to include the thinking content. It is implemented in the provided chat template in Jinja2. However, for frameworks that do not directly use the Jinja2 chat template, it is up to the developers to ensure that the best practice is followed.

  5. Long Video Understanding: To optimize inference efficiency for plain text and images, the size parameter in the released video_preprocessor_config.json is conservatively configured. It is recommended to set the longest_edge parameter in the video_preprocessor_config file to 469,762,048 (corresponding to 224k video tokens) to enable higher frame-rate sampling for hour-scale videos and thereby achieve superior performance. For example,

    {"longest_edge": 469762048, "shortest_edge": 4096}
    

    Alternatively, override the default values via engine startup parameters. For implementation details, refer to: vLLM / SGLang.

Citation

If you find our work helpful, feel free to give us a cite.

@misc{qwen3.5
    title  = {{Qwen3.5}: Towards Native Multimodal Agents},
    author = {{Qwen Team}},
    month  = {February},
    year   = {2026},
    url    = {https://qwen.ai/blog?id=qwen3.5}
}

Author: unsloth

Likes: 78

Downloads: 0

Tags: transformers, gguf, unsloth, qwen3_5_moe, image-text-to-text, base_model:Qwen/Qwen3.5-397B-A17B, base_model:quantized:Qwen/Qwen3.5-397B-A17B, license:apache-2.0, endpoints_compatible, region:us, imatrix, conversational

stepfun-ai/NextStep-1.1-Pretrain-256px


license: apache-2.0 pipeline_tag: text-to-image library_name: transformers

NextStep-1.1

Homepage  | GitHub  | Paper 

We introduce NextStep-1.1, a new model represents a significant leap forward in the NextStep series. This version effectively resolves the visualization failures seen in NextStep-1 and substantially elevates image quality through extended training and a Flow-based Reinforcement Learning (RL) post-training paradigm.

<div align='center'> <img src="assets/comparision.png" class="interpolation-image" alt="arch." width="100%" /> </div>

What's New in 1.1?

NextStep-1.1 is not just a fine-tune; it is a re-engineered version focused on stability and high-fidelity output. Key improvements include:

  • RL Enhanced Visual Fidelity: Significant improvement in image texture and a substantial reduction in visual artifacts via RL, ensuring much cleaner and more professional outputs.

  • Technical Stability: Solves numerical instability inherent in autoregressive flow-based models.

Environment Setup

To avoid potential errors when loading and running your models, we recommend using the following settings:

conda create -n nextstep python=3.11 -y
conda activate nextstep

pip install uv # optional

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/stepfun-ai/NextStep-1.1-Pretrain-256px && cd NextStep-1.1-Pretrain-256px
uv pip install -r requirements.txt

hf download stepfun-ai/NextStep-1.1-Pretrain-256px "vae/checkpoint.pt" --local-dir ./

Usage

import torch
from transformers import AutoTokenizer, AutoModel
from models.gen_pipeline import NextStepPipeline

HF_HUB = "stepfun-ai/NextStep-1.1-Pretrain-256px"

# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(HF_HUB, local_files_only=True, trust_remote_code=True)
model = AutoModel.from_pretrained(HF_HUB, local_files_only=True, trust_remote_code=True)
pipeline = NextStepPipeline(tokenizer=tokenizer, model=model).to(device="cuda", dtype=torch.bfloat16)

# set prompts
positive_prompt = ""
negative_prompt = "lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry."
example_prompt = "A REALISTIC PHOTOGRAPH OF A WALL WITH \"TOWARD AUTOREGRESSIVE IMAGE GENERATION WITH CONTINUOUS TOKENS AT SCALE\" PROMINENTLY DISPLAYED"

# generate image from text
IMG_SIZE = 256
image = pipeline.generate_image(
    example_prompt,
    hw=(IMG_SIZE, IMG_SIZE),
    num_images_per_caption=1,
    positive_prompt=positive_prompt,
    negative_prompt=negative_prompt,
    cfg=7.5,
    cfg_img=1.0,
    cfg_schedule="constant",
    use_norm=False,
    num_sampling_steps=28,
    timesteps_shift=1.0,
    seed=3407,
)[0]
image.save("./assets/output.jpg")

Citation

If you find NextStep useful for your research and applications, please consider starring this repository and citing:

@article{nextstepteam2025nextstep1,
  title={NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale},
  author={NextStep Team and Chunrui Han and Guopeng Li and Jingwei Wu and Quan Sun and Yan Cai and Yuang Peng and Zheng Ge and Deyu Zhou and Haomiao Tang and Hongyu Zhou and Kenkun Liu and Ailin Huang and Bin Wang and Changxin Miao and Deshan Sun and En Yu and Fukun Yin and Gang Yu and Hao Nie and Haoran Lv and Hanpeng Hu and Jia Wang and Jian Zhou and Jianjian Sun and Kaijun Tan and Kang An and Kangheng Lin and Liang Zhao and Mei Chen and Peng Xing and Rui Wang and Shiyu Liu and Shutao Xia and Tianhao You and Wei Ji and Xianfang Zeng and Xin Han and Xuelin Zhang and Yana Wei and Yanming Xu and Yimin Jiang and Yingming Wang and Yu Zhou and Yucheng Han and Ziyang Meng and Binxing Jiao and Daxin Jiang and Xiangyu Zhang and Yibo Zhu},
  journal={arXiv preprint arXiv:2508.10711},
  year={2025}
}

Author: stepfun-ai

Likes: 7

Downloads: 0

Tags: transformers, safetensors, nextstep, text-generation, text-to-image, custom_code, arxiv:2508.10711, license:apache-2.0, endpoints_compatible, region:us

dejanseo/google-links


base_model:

  • microsoft/deberta-v3-base pipeline_tag: token-classification tags:
  • links

Link Anchor Detection Model

A fine-tuned DeBERTa v3 model that predicts which words in text should be hyperlinks. Trained on 10,273 pages scraped from The Keyword (Google's official blog), where editorial linking decisions serve as ground truth labels.

How It Works

Given raw text, the model performs token-level binary classification — each token is labeled LINK or O (not a link). This identifies anchor text candidates: words that a human editor would likely hyperlink.

Pipeline

sitemap.xml (10,274 URLs from blog.google)
        │
        ▼
   scrape.py ──► scraped.db (SQLite, 10,273 pages with markdown + inline links)
        │
        ▼
    _prep.py ──► train_windows.jsonl / val_windows.jsonl
        │         • Strip markdown, annotate link spans as [LINK_START]...[LINK_END]
        │         • Tokenize with DeBERTa, align labels to tokens
        │         • Sliding windows (512 tokens, stride 128)
        │         • 90/10 doc-level split
        ▼
   train.py ──► model_link_token_cls/
        │         • Fine-tune microsoft/mdeberta-v3-base
        │         • Weighted cross-entropy (~25x for minority class)
        │         • 3 epochs, lr 2e-5, batch 16
        ▼
    app.py ──► Streamlit UI
                  • Sliding-window inference (handles any text length)
                  • Word-level highlighting with confidence scores

Data

Source: blog.google sitemap (The Keyword — Google's product and technology blog).

| Metric | Value | |---|---| | Pages scraped | 10,273 | | Total tokens | 8.2M | | Link tokens | 286,799 (3.48%) | | Training windows | 21,264 | | Validation windows | 2,402 |

The class imbalance (96.5% non-link vs 3.5% link) is handled with weighted cross-entropy loss during training.

Model

  • Base: microsoft/mdeberta-v3-base (DebertaV2ForTokenClassification)
  • Labels: O (0), LINK (1)
  • Max position: 512 tokens
  • Parameters: 12 layers, 768 hidden, 12 attention heads

Evaluation Results

| Metric | Value | |---|---| | Accuracy | 95.6% | | Precision | 42.4% | | Recall | 79.5% | | F1 | 0.553 |

High recall means the model catches most link-worthy text. Lower precision reflects the inherent ambiguity — many words could be linked, so "false positives" are often reasonable candidates.

Usage

Streamlit App

streamlit run app.py

Paste text, adjust the confidence threshold, and see predicted link anchors highlighted in green.

Python

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
import torch.nn.functional as F

tokenizer = AutoTokenizer.from_pretrained("model_link_token_cls")
model = AutoModelForTokenClassification.from_pretrained("model_link_token_cls")
model.eval()

text = "Google announced new features for Search and Gmail today."
enc = tokenizer(text, return_tensors="pt", return_offsets_mapping=True)
with torch.no_grad():
    logits = model(input_ids=enc["input_ids"], attention_mask=enc["attention_mask"]).logits
    probs = F.softmax(logits, dim=-1)[0, :, 1]  # P(LINK) per token

for token, offset, p in zip(
    tokenizer.convert_ids_to_tokens(enc["input_ids"][0]),
    enc["offset_mapping"][0],
    probs
):
    if offset[0] == offset[1]:
        continue  # skip special tokens
    if p > 0.5:
        print(f"  LINK: {text[offset[0]:offset[1]]} ({p:.2%})")

Scripts

| File | Purpose | |---|---| | scrape.py | Async Playwright scraper; reads sitemap.xml, saves to SQLite + markdown files | | _prep.py | Cleans markdown, annotates link spans, tokenizes, creates sliding windows | | train.py | Fine-tunes DeBERTa with weighted loss, W&B tracking | | app.py | Streamlit inference app with sliding-window support | | _count.py | Token length analysis utility | | _detok.py | Token ID decoder (Streamlit) |

Requirements

  • Python 3.8+
  • PyTorch
  • Transformers
  • Playwright (for scraping)
  • Streamlit (for inference app)

Author: dejanseo

Likes: 5

Downloads: 0

Tags: safetensors, deberta-v2, links, token-classification, base_model:microsoft/deberta-v3-base, base_model:finetune:microsoft/deberta-v3-base, region:us

mlx-community/Qwen3.5-397B-A17B-4bit


library_name: mlx license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3.5-397B-A17B/blob/main/LICENSE base_model: Qwen/Qwen3.5-397B-A17B pipeline_tag: text-generation tags:

  • mlx
  • 4bit
  • quantized
  • qwen3_5_moe
  • moe
  • mixture-of-experts
  • text-generation
  • conversational
  • apple-silicon language:
  • multilingual

Qwen3.5-397B-A17B-4bit (MLX)

4-bit MLX quantized version of the text model from Qwen/Qwen3.5-397B-A17B.

Portions of this card were copied or adapted from the original model card, authored by the Qwen team.

Model Overview

Qwen3.5-397B-A17B is Alibaba's latest flagship language model, featuring a hybrid architecture that combines Gated DeltaNet (linear attention) with sparse Mixture-of-Experts for high-throughput inference. Despite having 397B total parameters, only ~17B are activated per token, making it remarkably efficient for its capability level.

This conversion provides a text-only 4-bit quantized version optimized for local inference on Apple Silicon Macs via the MLX framework. The vision encoder from the original multimodal model is not included — for image/video understanding, refer to the original Qwen/Qwen3.5-397B-A17B.

Key Capabilities

  • 201 languages and dialects with deep cultural and regional understanding
  • 262K native context (extensible to 1M+ with YaRN)
  • Thinking mode with chain-of-thought reasoning (<think>...</think>)
  • Tool use and agentic workflows (MCP, function calling)
  • Competitive benchmarks: MMLU-Pro 87.8, SuperGPQA 70.4, C-Eval 93.0

Architecture

| Parameter | Value | |---|---| | Total Parameters | 397B | | Active Parameters | ~17B | | Hidden Size | 4,096 | | Layers | 60 | | Layer Layout | 15 × (3 × Gated DeltaNet + 1 × Full Attention), all with MoE FFN | | Total Experts | 512 | | Active Experts per Token | 10 routed + 1 shared | | Expert Intermediate Size | 1,024 | | Full Attention Heads | 32 Q / 2 KV (GQA), head dim 256 | | Linear Attention Heads | 16 QK / 64 V, head dim 128 | | Context Length | 262,144 tokens | | Vocab Size | 248,320 |

Quantization Details

| Parameter | Value | |---|---| | Method | Affine quantization | | Bits | 4-bit (weights) | | Group Size | 64 | | MoE Router Gates | 8-bit (preserved at higher precision) | | Model Size on Disk | ~223 GB |

The MoE router gates (mlp.gate and mlp.shared_expert_gate for all 60 layers) are kept at 8-bit precision to preserve routing accuracy, which is critical for Mixture-of-Experts models.

Requirements

  • Apple Silicon Mac with at least 256 GB unified memory (e.g., Mac Studio M3 Ultra 256GB+)
  • Python 3.10+
  • mlx-lm from the main branch

Installation

pip install git+https://github.com/ml-explore/mlx-lm

Usage

Quick Start — Python API

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Qwen3.5-397B-A17B-4bit")

messages = [{"role": "user", "content": "Explain the Riemann hypothesis in simple terms."}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True)

response = generate(
    model,
    tokenizer,
    prompt=prompt,
    max_tokens=4096,
    verbose=True,
    temp=0.6,
    top_p=0.95,
)

Thinking Mode (Default)

The model defaults to thinking mode, producing chain-of-thought reasoning inside <think>...</think> tags before the final answer:

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Qwen3.5-397B-A17B-4bit")

messages = [
    {"role": "user", "content": "How many r's are in the word 'strawberry'?"}
]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True)

response = generate(
    model,
    tokenizer,
    prompt=prompt,
    max_tokens=8192,
    verbose=True,
    temp=0.6,
    top_p=0.95,
)

Non-Thinking Mode

For faster, more direct responses without chain-of-thought reasoning:

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Qwen3.5-397B-A17B-4bit")

messages = [
    {"role": "user", "content": "Write a haiku about machine learning."}
]
prompt = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    enable_thinking=False,
)

response = generate(
    model,
    tokenizer,
    prompt=prompt,
    max_tokens=2048,
    verbose=True,
    temp=0.7,
    top_p=0.8,
)

Command Line

# Thinking mode (default)
mlx_lm.generate \
    --model mlx-community/Qwen3.5-397B-A17B-4bit \
    --prompt "What are the key differences between TCP and UDP?" \
    --max-tokens 4096 \
    --temp 0.6 \
    --top-p 0.95

# Start a local chat server (OpenAI-compatible)
mlx_lm.server --model mlx-community/Qwen3.5-397B-A17B-4bit

Local OpenAI-Compatible Server

Start the server:

mlx_lm.server --model mlx-community/Qwen3.5-397B-A17B-4bit --port 8080

Then query it with any OpenAI-compatible client:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")

response = client.chat.completions.create(
    model="mlx-community/Qwen3.5-397B-A17B-4bit",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a Python function to find all prime numbers up to n using the Sieve of Eratosthenes."},
    ],
    max_tokens=4096,
    temperature=0.6,
    top_p=0.95,
)
print(response.choices[0].message.content)

Or with curl:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3.5-397B-A17B-4bit",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 512,
    "temperature": 0.6
  }'

Recommended Generation Parameters

| Parameter | Thinking Mode | Non-Thinking Mode | |---|---|---| | temperature | 0.6 | 0.7 | | top_p | 0.95 | 0.8 | | top_k | 20 | 20 | | presence_penalty | 0.0 | 1.5 | | repetition_penalty | 1.0 | 1.0 | | max_tokens (general) | 32,768 | 32,768 | | max_tokens (math/code) | 81,920 | — |

Tips

  • Thinking mode is best for complex reasoning, math, and coding tasks. The model will produce internal reasoning before answering.
  • Non-thinking mode is better for straightforward Q&A, creative writing, and conversational use where latency matters.
  • For math problems, append: "Please reason step by step, and put your final answer within \boxed{}."
  • For multi-turn conversations, the default chat template automatically strips thinking content from prior turns.
  • If running into memory pressure, consider closing other applications to free unified memory.

Original Model

This is a quantized version of Qwen/Qwen3.5-397B-A17B. Refer to the original model card for full benchmark results, training details, and the technical report.

Citation

@misc{qwen3.5,
    title  = {{Qwen3.5}: Towards Native Multimodal Agents},
    author = {{Qwen Team}},
    month  = {February},
    year   = {2026},
    url    = {https://qwen.ai/blog?id=qwen3.5}
}

Author: mlx-community

Likes: 3

Downloads: 0

Tags: mlx, safetensors, qwen3_5_moe, 4bit, quantized, moe, mixture-of-experts, text-generation, conversational, apple-silicon, multilingual, base_model:Qwen/Qwen3.5-397B-A17B, base_model:quantized:Qwen/Qwen3.5-397B-A17B, license:apache-2.0, 4-bit, region:us

KnutJaegersberg/JoyAI-LLM-Flash-Q6_K-GGUF


language:

  • zh
  • en pipeline_tag: text-generation base_model: jdopensource/JoyAI-LLM-Flash tags:
  • llama-cpp
  • gguf-my-repo

KnutJaegersberg/JoyAI-LLM-Flash-Q6_K-GGUF

This model was converted to GGUF format from jdopensource/JoyAI-LLM-Flash using llama.cpp via the ggml.ai's GGUF-my-repo space. Refer to the original model card for more details on the model.

Use with llama.cpp

Install llama.cpp through brew (works on Mac and Linux)

brew install llama.cpp

Invoke the llama.cpp server or the CLI.

CLI:

llama-cli --hf-repo KnutJaegersberg/JoyAI-LLM-Flash-Q6_K-GGUF --hf-file joyai-llm-flash-q6_k.gguf -p "The meaning to life and the universe is"

Server:

llama-server --hf-repo KnutJaegersberg/JoyAI-LLM-Flash-Q6_K-GGUF --hf-file joyai-llm-flash-q6_k.gguf -c 2048

Note: You can also use this checkpoint directly through the usage steps listed in the Llama.cpp repo as well.

Step 1: Clone llama.cpp from GitHub.

git clone https://github.com/ggerganov/llama.cpp

Step 2: Move into the llama.cpp folder and build it with LLAMA_CURL=1 flag along with other hardware-specific flags (for ex: LLAMA_CUDA=1 for Nvidia GPUs on Linux).

cd llama.cpp && LLAMA_CURL=1 make

Step 3: Run inference through the main binary.

./llama-cli --hf-repo KnutJaegersberg/JoyAI-LLM-Flash-Q6_K-GGUF --hf-file joyai-llm-flash-q6_k.gguf -p "The meaning to life and the universe is"

or

./llama-server --hf-repo KnutJaegersberg/JoyAI-LLM-Flash-Q6_K-GGUF --hf-file joyai-llm-flash-q6_k.gguf -c 2048

Author: KnutJaegersberg

Likes: 3

Downloads: 0

Tags: gguf, llama-cpp, gguf-my-repo, text-generation, zh, en, base_model:jdopensource/JoyAI-LLM-Flash, base_model:quantized:jdopensource/JoyAI-LLM-Flash, endpoints_compatible, region:us

unsloth/Qwen3.5-397B-A17B


tags:

  • unsloth base_model:
  • Qwen/Qwen3.5-397B-A17B library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3.5-397B-A17B/blob/main/LICENSE pipeline_tag: image-text-to-text

Qwen3.5-397B-A17B

<img width="400px" src="https://qianwen-res.oss-accelerate.aliyuncs.com/logo_qwen3.5.png">

Qwen Chat

[!Note] This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format.

These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, etc.

[!Tip] For users seeking managed, scalable inference without infrastructure maintenance, the official Qwen API service is provided by Alibaba Cloud Model Studio.

In particular, Qwen3.5-Plus is the hosted version corresponding to Qwen3.5-397B-A17B with more production features, e.g., 1M context length by default, official built-in tools, and adaptive tool use. For more information, please refer to the User Guide.

Over recent months, we have intensified our focus on developing foundation models that deliver exceptional utility and performance. Qwen3.5 represents a significant leap forward, integrating breakthroughs in multimodal learning, architectural efficiency, reinforcement learning scale, and global accessibility to empower developers and enterprises with unprecedented capability and efficiency.

Qwen3.5 Highlights

Qwen3.5 features the following enhancement:

  • Unified Vision-Language Foundation: Early fusion training on multimodal tokens achieves cross-generational parity with Qwen3 and outperforms Qwen3-VL models across reasoning, coding, agents, and visual understanding benchmarks.

  • Efficient Hybrid Architecture: Gated Delta Networks combined with sparse Mixture-of-Experts deliver high-throughput inference with minimal latency and cost overhead.

  • Scalable RL Generalization: Reinforcement learning scaled across million-agent environments with progressively complex task distributions for robust real-world adaptability.

  • Global Linguistic Coverage: Expanded support to 201 languages and dialects, enabling inclusive, worldwide deployment with nuanced cultural and regional understanding.

  • Next-Generation Training Infrastructure: Near-100% multimodal training efficiency compared to text-only training and asynchronous RL frameworks supporting massive-scale agent scaffolds and environment orchestration.

Benchmark Results

For more details, please refer to our blog post Qwen3.5.

Model Overview

  • Type: Causal Language Model with Vision Encoder
  • Training Stage: Pre-training & Post-training
  • Language Model
    • Number of Parameters: 397B in total and 17B activated
    • Hidden Dimension: 4096
    • Token Embedding: 248320 (Padded)
    • Number of Layers: 60
      • Hidden Layout: 15 * (3 * (Gated DeltaNet -> MoE) -> 1 * (Gated Attention -> MoE))
    • Gated DeltaNet:
      • Number of Linear Attention Heads: 64 for V and 16 for QK
      • Head Dimension: 128
    • Gated Attention:
      • Number of Attention Heads: 32 for Q and 2 for KV
      • Head Dimension: 256
      • Rotary Position Embedding Dimension: 64
    • Mixture Of Experts
      • Number of Experts: 512
      • Number of Activated Experts: 10 Routed + 1 Shared
      • Expert Intermediate Dimension: 1024
    • LM Output: 248320 (Padded)
    • MTP: trained with multi-steps
  • Context Length: 262,144 natively and extensible up to 1,010,000 tokens.

Benchmark Results

Language

<div style="font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,sans-serif;color:#1a1a2e;max-width:900px;margin:0 auto;padding:16px 0"> <table style="width:100%;border-collapse:collapse;font-size:13px"> <thead><tr> <th style="padding:10px 12px;text-align:left;font-weight:600;border-bottom:2px solid #7c3aed;color:#4c1d95"></th> <th style="padding:10px 12px;text-align:center;font-weight:600;border-bottom:2px solid #7c3aed;color:#4c1d95">GPT5.2</th> <th style="padding:10px 12px;text-align:center;font-weight:600;border-bottom:2px solid #7c3aed;color:#4c1d95">Claude 4.5 Opus</th> <th style="padding:10px 12px;text-align:center;font-weight:600;border-bottom:2px solid #7c3aed;color:#4c1d95">Gemini-3 Pro</th> <th style="padding:10px 12px;text-align:center;font-weight:600;border-bottom:2px solid #7c3aed;color:#4c1d95">Qwen3-Max-Thinking</th> <th style="padding:10px 12px;text-align:center;font-weight:600;border-bottom:2px solid #7c3aed;color:#4c1d95">K2.5-1T-A32B</th> <th style="padding:10px 12px;text-align:center;font-weight:600;border-bottom:2px solid #7c3aed;color:#4c1d95">Qwen3.5-397B-A17B</th> </tr></thead> <tbody> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid #e5e7eb;background:#faf5ff">Knowledge</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">MMLU-Pro</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">87.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">89.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">89.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">85.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">87.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">87.8</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">MMLU-Redux</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">95.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">95.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">95.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">92.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">94.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">94.9</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">SuperGPQA</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">67.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">70.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">74.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">67.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">69.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">70.4</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">C-Eval</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">90.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">92.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">93.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">93.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">94.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">93.0</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid #e5e7eb;background:#faf5ff">Instruction Following</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">IFEval</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">94.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">90.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">93.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">93.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">93.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">92.6</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">IFBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">75.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">58.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">70.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">70.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">70.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">76.5</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">MultiChallenge</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">57.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">54.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">64.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">63.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">62.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">67.6</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid #e5e7eb;background:#faf5ff">Long Context</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">AA-LCR</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">72.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">74.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">70.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">68.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">70.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">68.7</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">LongBench v2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">54.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">64.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">68.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">60.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">61.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">63.2</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid #e5e7eb;background:#faf5ff">STEM</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">GPQA</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">92.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">87.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">91.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">87.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">87.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">88.4</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">HLE</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">35.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">30.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">37.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">30.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">30.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">28.7</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">HLE-Verified¹</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">43.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">38.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">48</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">37.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">37.6</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid #e5e7eb;background:#faf5ff">Reasoning</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">LiveCodeBench v6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">87.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">84.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">90.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">85.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">85.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">83.6</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">HMMT Feb 25</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">99.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">92.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">97.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">98.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">95.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">94.8</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">HMMT Nov 25</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">100</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">93.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">93.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">94.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">91.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">92.7</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">IMOAnswerBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">86.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">84.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">83.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">83.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">81.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">80.9</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">AIME26</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">96.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">93.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">90.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">93.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">93.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">91.3</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid #e5e7eb;background:#faf5ff">General Agent</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">BFCL-V4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">63.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">77.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">72.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">67.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">68.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">72.9</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">TAU2-Bench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">87.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">91.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">85.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">84.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">77.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">86.7</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">VITA-Bench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">38.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">56.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">51.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">40.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">41.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">49.7</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">DeepPlanning</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">44.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">33.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">23.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">28.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">14.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">34.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">Tool Decathlon</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">43.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">43.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">36.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">18.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">27.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">38.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">MCP-Mark</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">57.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">42.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">53.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">33.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">29.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">46.1</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid #e5e7eb;background:#faf5ff">Search Agent³</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">HLE w/ tool</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">45.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">43.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">45.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">49.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">50.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">48.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">BrowseComp</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">65.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">67.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">59.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">53.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">--/74.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">69.0/78.6</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">BrowseComp-zh</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">76.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">62.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">66.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">60.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">70.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">WideSearch</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">76.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">76.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">68.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">57.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">72.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">74.0</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">Seal-0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">45.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">47.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">45.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">46.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">57.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">46.9</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid #e5e7eb;background:#faf5ff">Multilingualism</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">MMMLU</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">89.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">90.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">90.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">84.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">86.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">88.5</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">MMLU-ProX</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">83.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">85.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">87.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">78.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">82.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">84.7</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">NOVA-63</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">54.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">56.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">56.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">54.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">56.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">59.1</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">INCLUDE</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">87.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">86.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">90.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">82.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">83.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">85.6</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">Global PIQA</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">90.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">91.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">93.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">86.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">89.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">89.8</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">PolyMATH</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">62.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">79.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">81.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">64.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">43.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">73.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">WMT24++</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">78.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">79.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">80.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">77.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">77.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">78.9</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">MAXIFE</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">88.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">79.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">87.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">84.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">72.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">88.2</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid #e5e7eb;background:#faf5ff">Coding Agent</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">SWE-bench Verified</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">80.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">80.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">76.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">75.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">76.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">76.4</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">SWE-bench Multilingual</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">72.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">77.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">65.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">66.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">73.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">72.0</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">SecCodeBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">68.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">68.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">62.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">57.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">61.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">68.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">Terminal Bench 2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">54.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">59.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">54.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">22.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">50.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">52.5</td> </tr> </tbody> </table> <p style="margin-top:12px;font-size:11px;color:#888"> * HLE-Verified: a verified and revised version of Humanity’s Last Exam (HLE), accompanied by a transparent, component-wise verification protocol and a fine-grained error taxonomy. We open-source the dataset at https://huggingface.co/datasets/skylenage/HLE-Verified.<br> * TAU2-Bench:we follow the official setup except for the airline domain, where all models are evaluated by applying the fixes proposed in the Claude Opus 4.5 system card.<br> * MCPMark: GitHub MCP server uses v0.30.3 from api.githubcopilot.com; Playwright tool responses are truncated at 32k tokens.<br> * Seach Agent: most Search Agents built on our model adopt a simple context-folding strategy(256k): once the cumulative Tool Response length reaches a preset threshold, earlier Tool Responses are pruned from the history to keep the context within limits.<br> * BrowseComp: we tested two strategies, simple context-folding achieved a score of 69.0, while using the same discard-all strategy as DeepSeek-V3.2 and Kimi K2.5 achieved 78.6.<br> * WideSearch: we use a 256k context window without any context management.<br> * MMLU-ProX: we report the averaged accuracy on 29 languages.<br> * WMT24++: a harder subset of WMT24 after difficulty labeling and rebalancing; we report the averaged scores on 55 languages using XCOMET-XXL.<br> * MAXIFE: we report the accuracy on English + multilingual original prompts (totally 23 settings).<br> * Empty cells (--) indicate scores not yet available or not applicable.<br> </p> </div>

Vision Language

<div style="font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,sans-serif;color:#1a1a2e;max-width:900px;margin:0 auto;padding:16px 0"> <table style="width:100%;border-collapse:collapse;font-size:13px"> <thead><tr> <th style="padding:10px 12px;text-align:left;font-weight:600;border-bottom:2px solid #7c3aed;color:#4c1d95"></th> <th style="padding:10px 12px;text-align:center;font-weight:600;border-bottom:2px solid #7c3aed;color:#4c1d95">GPT5.2</th> <th style="padding:10px 12px;text-align:center;font-weight:600;border-bottom:2px solid #7c3aed;color:#4c1d95">Claude 4.5 Opus</th> <th style="padding:10px 12px;text-align:center;font-weight:600;border-bottom:2px solid #7c3aed;color:#4c1d95">Gemini-3 Pro</th> <th style="padding:10px 12px;text-align:center;font-weight:600;border-bottom:2px solid #7c3aed;color:#4c1d95">Qwen3-VL-235B-A22B</th> <th style="padding:10px 12px;text-align:center;font-weight:600;border-bottom:2px solid #7c3aed;color:#4c1d95">K2.5-1T-A32B</th> <th style="padding:10px 12px;text-align:center;font-weight:600;border-bottom:2px solid #7c3aed;color:#4c1d95">Qwen3.5-397B-A17B</th> </tr></thead> <tbody> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid #e5e7eb;background:#faf5ff">STEM and Puzzle</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">MMMU</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">86.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">80.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">87.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">80.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">84.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">85.0</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">MMMU-Pro</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">79.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">70.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">81.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">69.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">78.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">79.0</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">MathVision</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">83.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">74.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">86.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">74.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">84.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">88.6</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">Mathvista(mini)</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">83.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">80.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">87.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">85.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">90.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">90.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">We-Math</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">79.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">70.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">86.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">74.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">84.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">87.9</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">DynaMath</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">86.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">79.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">85.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">82.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">84.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">86.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">ZEROBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">10</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">12</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">ZEROBench_sub</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">33.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">28.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">39.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">28.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">33.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">41.0</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">BabyVision</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">34.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">14.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">49.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">22.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">36.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">52.3/43.3</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid #e5e7eb;background:#faf5ff">General VQA</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">RealWorldQA</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">83.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">77.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">83.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">81.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">81.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">83.9</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">MMStar</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">77.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">73.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">83.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">78.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">80.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">83.8</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">HallusionBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">65.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">64.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">68.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">66.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">69.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">71.4</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">MMBench<sub>EN-DEV-v1.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">88.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">89.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">93.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">89.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">94.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">93.7</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">SimpleVQA</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">55.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">65.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">73.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">61.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">71.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">67.1</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid #e5e7eb;background:#faf5ff">Text Recognition and Document Understanding</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">OmniDocBench1.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">85.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">87.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">88.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">84.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">88.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">90.8</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">CharXiv(RQ)</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">82.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">68.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">81.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">66.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">77.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">80.8</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">MMLongBench-Doc</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">61.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">60.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">56.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">58.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">61.5</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">CC-OCR</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">70.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">76.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">79.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">81.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">79.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">82.0</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">AI2D_TEST</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">92.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">87.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">94.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">89.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">90.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">93.9</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">OCRBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">80.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">85.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">90.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">87.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">92.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">93.1</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid #e5e7eb;background:#faf5ff">Spatial Intelligence</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">ERQA</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">59.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">46.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">70.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">52.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">67.5</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">CountBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">91.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">90.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">97.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">93.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">94.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">97.2</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">RefCOCO(avg)</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">84.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">91.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">87.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">92.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">ODInW13</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">46.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">43.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">47.0</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">EmbSpatialBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">81.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">75.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">61.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">84.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">77.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">84.5</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">RefSpatialBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">65.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">69.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">73.6</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">LingoQA</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">68.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">78.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">72.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">66.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">68.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">81.6</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">V*</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">75.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">67.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">88.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">85.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">77.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">95.8/91.1</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">Hypersim</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">11.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">12.5</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">SUNRGBD</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">34.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">38.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">Nuscene</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">13.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">16.0</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid #e5e7eb;background:#faf5ff">Video Understanding</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">VideoMME (w sub.)</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">86</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">77.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">88.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">83.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">87.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">87.5</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">VideoMME (w/o sub.)</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">85.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">81.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">87.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">79.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">83.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">83.7</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">VideoMMMU</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">85.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">84.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">87.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">80.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">86.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">84.7</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">MLVU (M-Avg)</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">85.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">81.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">83.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">83.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">85.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">86.7</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">MVBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">78.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">67.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">74.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">75.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">73.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">77.6</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">LVBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">73.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">57.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">76.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">63.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">75.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">75.5</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">MMVU</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">80.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">77.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">77.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">71.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">80.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">75.4</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid #e5e7eb;background:#faf5ff">Visual Agent</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">ScreenSpot Pro</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">45.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">72.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">62.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">65.6</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">OSWorld-Verified</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">38.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">66.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">38.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">63.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">62.2</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">AndroidWorld</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">63.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">66.8</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid #e5e7eb;background:#faf5ff">Medical VQA</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">VQA-RAD</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">69.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">65.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">74.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">65.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">79.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">76.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">SLAKE</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">76.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">76.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">81.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">54.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">81.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">79.9</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">OM-VQA</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">72.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">75.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">80.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">65.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">87.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">85.1</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">PMC-VQA</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">58.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">59.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">62.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">41.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">63.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">64.2</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid #f0f0f0;color:#444">MedXpertQA-MM</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">73.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">63.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">76.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">47.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">65.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid #f0f0f0">70.0</td> </tr> </tbody> </table> <p style="margin-top:12px;font-size:11px;color:#888"> * MathVision:our model’s score is evaluated using a fixed prompt, e.g., “Please reason step by step, and put your final answer within \boxed{}.” For other models, we report the higher score between runs with and without the \boxed{} formatting.<br> * BabyVision: our model’s score is reported with CI (Code Interpreter) enabled; without CI, the result is 43.3.<br> * V*: our model’s score is reported with CI (Code Interpreter) enabled; without CI, the result is 91.1.<br> * Empty cells (--) indicate scores not yet available or not applicable.<br> </p> </div>

Quickstart

[!Important] Qwen3.5 models operate in thinking mode by default, generating thinking content signified by <think>\n...</think>\n\n before producing the final responses. To disable thinking content and obtain direct response, refer to the examples here.

For streamlined integration, we recommend using Qwen3.5 via APIs. Below is a guide to use Qwen3.5 via OpenAI-compatible API.

Serving Qwen3.5

Qwen3.5 can be served via APIs with popular inference frameworks. In the following, we show example commands to launch OpenAI-Compatible API servers for Qwen3.5 models.

[!Important] Inference efficiency and throughput vary significantly across frameworks. We recommend using the latest framework versions to ensure optimal performance and compatibility. For production workloads or high-throughput scenarios, dedicated serving engines such as SGLang or vLLM are strongly recommended.

[!Important] The model has a default context length of 262,144 tokens. If you encounter out-of-memory (OOM) errors, consider reducing the context window. However, because Qwen3.5 leverages extended context for complex tasks, we advise maintaining a context length of at least 128K tokens to preserve thinking capabilities.

SGLang

SGLang is a fast serving framework for large language models and vision language models. SGLang from the main branch of the open-source repository is required for Qwen3.5, which can be installed using the following command in a fresh environment:

uv pip install 'git+https://github.com/sgl-project/sglang.git#subdirectory=python&egg=sglang[all]'

See its documentation for more details.

The following will create API endpoints at http://localhost:8000/v1:

  • Standard Version: The following command can be used to create an API endpoint with maximum context length 262,144 tokens using tensor parallel on 8 GPUs.

    python -m sglang.launch_server --model-path Qwen/Qwen3.5-397B-A17B --port 8000 --tp-size 8 --mem-fraction-static 0.8 --context-length 262144 --reasoning-parser qwen3
    
  • Tool Use: To support tool use, you can use the following command.

    python -m sglang.launch_server --model-path Qwen/Qwen3.5-397B-A17B --port 8000 --tp-size 8 --mem-fraction-static 0.8 --context-length 262144 --reasoning-parser qwen3 --tool-call-parser qwen3_coder
    
  • Multi-Token Prediction (MTP): The following command is recommended for MTP:

    python -m sglang.launch_server --model-path Qwen/Qwen3.5-397B-A17B --port 8000 --tp-size 8 --mem-fraction-static 0.8 --context-length 262144 --reasoning-parser qwen3 --speculative-algo NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4
    

vLLM

vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. vLLM from the main branch of the open-source repository is required for Qwen3.5, which can be installed using the following command in a fresh environment:

uv pip install vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly

See its documentation for more details.

For detailed Qwen3.5 usage guide, see the vLLM Qwen3.5 recipe.

The following will create API endpoints at http://localhost:8000/v1:

  • Standard Version: The following command can be used to create an API endpoint with maximum context length 262,144 tokens using tensor parallel on 8 GPUs.

    vllm serve Qwen/Qwen3.5-397B-A17B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 
    
  • Tool Call: To support tool use, you can use the following command.

    vllm serve Qwen/Qwen3.5-397B-A17B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder 
    
  • Multi-Token Prediction (MTP): The following command is recommended for MTP:

    vllm serve Qwen/Qwen3.5-397B-A17B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
    
  • Text-Only: The following command skips the vision encoder and multimodal profiling to free up memory for additional KV cache:

    vllm serve Qwen/Qwen3.5-397B-A17B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 --language-model-only
    

Hugging Face Transformers

Hugging Face Transformers contains a lightweight server which can be used for quick testing and moderate load deployment. The latest transformers is required for Qwen3.5:

pip install "transformers[serving] @ git+https://github.com/huggingface/transformers.git@main"

See its documentation for more details.

Then, run transformers serve to launch a server with API endpoints at http://localhost:8000/v1; it will place the model on accelerators if available:

transformers serve --force-model Qwen/Qwen3.5-397B-A17B --port 8000 --continuous-batching

Using Qwen3.5 via the Chat Completions API

The chat completions API is accessible via standard HTTP requests or OpenAI SDKs. Here, we show examples using the OpenAI Python SDK.

Before starting, make sure it is installed and the API key and the API base URL is configured, e.g.:

pip install -U openai

# Set the following accordingly
export OPENAI_BASE_URL="http://localhost:8000/v1"
export OPENAI_API_KEY="EMPTY"

[!Tip] We recommend using the following set of sampling parameters for generation

  • Thinking mode: temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0
  • Instruct (or non-thinking) mode: temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

Text-Only Input

from openai import OpenAI
# Configured by environment variables
client = OpenAI()

messages = [
    {"role": "user", "content": "Type \"I love Qwen3.5\" backwards"},
]

chat_response = client.chat.completions.create(
    model="Qwen/Qwen3.5-397B-A17B",
    messages=messages,
    max_tokens=81920,
    temperature=0.6,
    top_p=0.95,
    extra_body={
        "top_k": 20,
    }, 
)
print("Chat response:", chat_response)

Image Input

from openai import OpenAI
# Configured by environment variables
client = OpenAI()

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.5/demo/CI_Demo/mathv-1327.jpg"
                }
            },
            {
                "type": "text",
                "text": "The centres of the four illustrated circles are in the corners of the square. The two big circles touch each other and also the two little circles. With which factor do you have to multiply the radii of the little circles to obtain the radius of the big circles?\nChoices:\n(A) $\\frac{2}{9}$\n(B) $\\sqrt{5}$\n(C) $0.8 \\cdot \\pi$\n(D) 2.5\n(E) $1+\\sqrt{2}$"
            }
        ]
    }
]

response = client.chat.completions.create(
    model="Qwen/Qwen3.5-397B-A17B",
    messages=messages,
    max_tokens=81920,
    temperature=0.6,
    top_p=0.95,
    extra_body={
        "top_k": 20,
    }, 
)
print("Chat response:", chat_response)

Video Input

from openai import OpenAI
# Configured by environment variables
client = OpenAI()

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video_url",
                "video_url": {
                    "url": "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.5/demo/video/N1cdUjctpG8.mp4"
                }
            },
            {
                "type": "text",
                "text": "How many porcelain jars were discovered in the niches located in the primary chamber of the tomb?"
            }
        ]
    }
]

# When vLLM is launched with `--media-io-kwargs '{"video": {"num_frames": -1}}'`,
# video frame sampling can be configured via `extra_body` (e.g., by setting `fps`).
# This feature is currently supported only in vLLM.
#
# By default, `fps=2` and `do_sample_frames=True`.
# With `do_sample_frames=True`, you can customize the `fps` value to set your desired video sampling rate.
response = client.chat.completions.create(
    model="Qwen/Qwen3.5-397B-A17B",
    messages=messages,
    max_tokens=81920,
    temperature=0.6,
    top_p=0.95,
    extra_body={
        "top_k": 20,
        "mm_processor_kwargs": {"fps": 2, "do_sample_frames": True},
    }, 
)

print("Chat response:", chat_response)

Instruct (or Non-Thinking) Mode

[!Important] Qwen3.5 does not officially support the soft switch of Qwen3, i.e., /think and /nothink.

Qwen3.5 will think by default before response. You can obtain direct response from the model without thinking by configuring the API parameters. For example,

from openai import OpenAI
# Configured by environment variables
client = OpenAI()

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.5/demo/RealWorld/RealWorld-04.png"
                }
            },
            {
                "type": "text",
                "text": "Where is this?"
            }
        ]
    }
]

chat_response = client.chat.completions.create(
    model="Qwen/Qwen3.5-397B-A17B",
    messages=messages,
    max_tokens=32768,
    temperature=0.7,
    top_p=0.8,
    presence_penalty=1.5,
    extra_body={
        "top_k": 20,
        "chat_template_kwargs": {"enable_thinking": False},
    }, 
)
print("Chat response:", chat_response)

[!Note] If you are using APIs from Alibaba Cloud Model Studio, in addition to changing model, please use "enable_thinking": False instead of "chat_template_kwargs": {"enable_thinking": False}.

Agentic Usage

Qwen3.5 excels in tool calling capabilities.

Qwen-Agent

We recommend using Qwen-Agent to quickly build Agent applications with Qwen3.5.

To define the available tools, you can use the MCP configuration file, use the integrated tool of Qwen-Agent, or integrate other tools by yourself.

import os
from qwen_agent.agents import Assistant

# Define LLM
# Using Alibaba Cloud Model Studio
llm_cfg = {
    # Use the OpenAI-compatible model service provided by DashScope:
    'model': 'Qwen3.5-397B-A17B',
    'model_type': 'qwenvl_oai',
    'model_server': 'https://dashscope.aliyuncs.com/compatible-mode/v1',
    'api_key': os.getenv('DASHSCOPE_API_KEY'),

    'generate_cfg': {
        'use_raw_api': True,
        # When using Dash Scope OAI API, pass the parameter of whether to enable thinking mode in this way
        'extra_body': {
            'enable_thinking': True
        },
    },
}

# Using OpenAI-compatible API endpoint.
# functionality of the deployment frameworks and let Qwen-Agent automate the related operations.
#
# llm_cfg = {
#     # Use your own model service compatible with OpenAI API by vLLM/SGLang:
#     'model': 'Qwen/Qwen3.5-397B-A17B',
#     'model_type': 'qwenvl_oai',
#     'model_server': 'http://localhost:8000/v1',  # api_base
#     'api_key': 'EMPTY',
#
#     'generate_cfg': {
#         'use_raw_api': True,
#         # When using vLLM/SGLang OAI API, pass the parameter of whether to enable thinking mode in this way
#         'extra_body': {
#             'chat_template_kwargs': {'enable_thinking': True}
#         },
#     },
# }

# Define Tools
tools = [
    {'mcpServers': {  # You can specify the MCP configuration file
            "filesystem": {
                "command": "npx",
                "args": ["-y", "@modelcontextprotocol/server-filesystem", "/Users/xxxx/Desktop"]
            }
        }
    }
]

# Define Agent
bot = Assistant(llm=llm_cfg, function_list=tools)

# Streaming generation
messages = [{'role': 'user', 'content': 'Help me organize my desktop.'}]
for responses in bot.run(messages=messages):
    pass
print(responses)

# Streaming generation
messages = [{'role': 'user', 'content': 'Develop a dog website and save it on the desktop'}]
for responses in bot.run(messages=messages):
    pass
print(responses)

Qwen Code

Qwen Code is an open-source AI agent for the terminal, optimized for Qwen models. It helps you understand large codebases, automate tedious work, and ship faster.

For more information, please refer to Qwen Code.

Processing Ultra-Long Texts

Qwen3.5 natively supports context lengths of up to 262,144 tokens. For long-horizon tasks where the total length (including both input and output) exceeds this limit, we recommend using RoPE scaling techniques to handle long texts effectively., e.g., YaRN.

YaRN is currently supported by several inference frameworks, e.g., transformers, vllm and sglang. In general, there are two approaches to enabling YaRN for supported frameworks:

  • Modifying the model configuration file: In the config.json file, change the rope_parameters fields in text_config to:

    {
        "mrope_interleaved": true,
        "mrope_section": [
            11,
            11,
            10
        ],
        "rope_type": "yarn",
        "rope_theta": 10000000,
        "partial_rotary_factor": 0.25,
        "factor": 4.0,
        "original_max_position_embeddings": 262144,
    }
    
  • Passing command line arguments:

    For vllm, you can use

    VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve ... --hf-overrides '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}' --max-model-len 1010000  
    

    For sglang, you can use

    SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python -m sglang.launch_server ... --json-model-override-args '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}' --context-length 1010000
    

[!NOTE] All the notable open-source frameworks implement static YaRN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise modifying the rope_parameters configuration only when processing long contexts is required. It is also recommended to modify the factor as needed. For example, if the typical context length for your application is 524,288 tokens, it would be better to set factor as 2.0.

Best Practices

To achieve optimal performance, we recommend the following settings:

  1. Sampling Parameters:

    • We suggest using Temperature=0.6, TopP=0.95, TopK=20, and MinP=0 for thinking mode and using Temperature=0.7, TopP=0.8, TopK=20, and MinP=0 for non-thinking mode.
    • For supported frameworks, you can adjust the presence_penalty parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance.
  2. Adequate Output Length: We recommend using an output length of 32,768 tokens for most queries. For benchmarking on highly complex problems, such as those found in math and programming competitions, we suggest setting the max output length to 81,920 tokens. This provides the model with sufficient space to generate detailed and comprehensive responses, thereby enhancing its overall performance.

  3. Standardize Output Format: We recommend using prompts to standardize model outputs when benchmarking.

    • Math Problems: Include "Please reason step by step, and put your final answer within \boxed{}." in the prompt.
    • Multiple-Choice Questions: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the answer field with only the choice letter, e.g., "answer": "C"."
  4. No Thinking Content in History: In multi-turn conversations, the historical model output should only include the final output part and does not need to include the thinking content. It is implemented in the provided chat template in Jinja2. However, for frameworks that do not directly use the Jinja2 chat template, it is up to the developers to ensure that the best practice is followed.

  5. Long Video Understanding: It is recommended to set the longest_edge parameter in the video_preprocessor_config file to 469,762,048 (corresponding to 224k video tokens) to enable higher frame-rate sampling for hour-scale videos and thereby achieve superior performance.

Citation

If you find our work helpful, feel free to give us a cite.

@misc{qwen3.5
    title  = {{Qwen3.5}: Towards Native Multimodal Agents},
    author = {{Qwen Team}},
    month  = {February},
    year   = {2026},
    url    = {https://qwen.ai/blog?id=qwen3.5}
}

Author: unsloth

Likes: 3

Downloads: 0

Tags: transformers, safetensors, qwen3_5_moe, image-text-to-text, unsloth, conversational, base_model:Qwen/Qwen3.5-397B-A17B, base_model:finetune:Qwen/Qwen3.5-397B-A17B, license:apache-2.0, endpoints_compatible, region:us

lukealonso/GLM-5-NVFP4


base_model:

  • zai-org/GLM-5 license: mit

(Weights still uploading...)

Model Description

GLM-5-NVFP4 is an NVFP4-quantized version of zai-org/GLM-5, a 744B-parameter Mixture-of-Experts language model with 40B active parameters, 256 experts per MoE layer (8 activated per token), and DeepSeek Sparse Attention (DSA).

Quantized directly from the full BF16 checkpoint (zai-org/GLM-5), to NVFP4 (4-bit with blockwise FP8 scales per 16 elements) using NVIDIA Model Optimizer.

What's quantized

Only the MoE expert MLP layers (gate, up, and down projections) are quantized to NVFP4. Attention layers are left in BF16. Since the expert weights constitute the vast majority of model parameters in an MoE architecture, this still yields significant memory savings.

Calibration uses natural top-k routing rather than forcing all experts to activate, so each expert's quantization scales reflect the token distributions it actually sees during inference. To compensate, calibration was run on a much larger number of samples than typical to ensure broad expert coverage through natural routing alone.

Calibration dataset

Three calibration passes were run:

  1. Coding pass — Agentic coding samples (tool calling, multi-turn code generation, function calling) with English and Chinese system prompts.
  2. Broad pass — Large-scale diverse samples drawn from WildChat and LMSYS-Chat covering real user conversations across a wide range of topics and languages.
  3. Deep pass — Long-context samples (>8K tokens) from coding and diverse sources to exercise deep-sequence expert activation patterns.

Merged via element-wise max across all calibration runs.

How to Run

Even in NVFP4, this is a huge model and it requires 440GB of VRAM just for the weights, and easily 100GB more for activations and the K/V cache.

I've tested it with 8x RTX 6000 Pro and get ~50 tokens/sec.

4x RTX 6000 Pro may be possible with CPU offloading, but I haven't tried it.

If you experience NCCL hangs with P2P, make sure you have iommu=pt (and amd_iommu=pt on AMD platforms) in your kernel command line.

SGLang

export NCCL_IB_DISABLE=1
export NCCL_P2P_LEVEL=PHB
export NCCL_ALLOC_P2P_NET_LL_BUFFERS=1
export NCCL_MIN_NCHANNELS=8
export OMP_NUM_THREADS=8
export SAFETENSORS_FAST_GPU=1

python3 -m sglang.launch_server \
  --model lukealonso/GLM-5-NVFP4 \
  --served-model-name glm-5 \
  --reasoning-parser glm45 \
  --tool-call-parser glm47 \
  --trust-remote-code \
  --tp 8 \
  --mem-fraction-static 0.95 \
  --max-running-requests 8 \
  --kv-cache-dtype fp8_e4m3 \
  --quantization modelopt_fp4 \
  --attention-backend flashinfer \
  --moe-runner-backend flashinfer_cutlass \
  --disable-custom-all-reduce \
  --enable-flashinfer-allreduce-fusion \
  --host 0.0.0.0 \
  --port 8000

vLLM

Please contribute vLLM instructions if you successfully manage to run this model.

Author: lukealonso

Likes: 3

Downloads: 0

Tags: safetensors, glm_moe_dsa, base_model:zai-org/GLM-5, base_model:quantized:zai-org/GLM-5, license:mit, 8-bit, modelopt, region:us

ubergarm/Qwen3.5-397B-A17B-GGUF


quantized_by: ubergarm pipeline_tag: text-generation base_model: Qwen/Qwen3.5-397B-A17B base_model_relation: quantized license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3.5-397B-A17B/blob/main/LICENSE tags:

  • imatrix
  • conversational
  • qwen3_5_moe
  • ik_llama.cpp

WIP

There is not yet support in ik_llama.cpp, though an open issue and work on qwen3next coming in now.

For now to help out with testing, used mainline llama.cpp to make imatrix (gguf format) if others would like to use it to make their own imatrix custom quants.

Check the logs/ directory for details on imatrix calculation.

I'll upload more if/when ik_llama.cpp support is merged.

It seems to inference very slowly on CPU-only and probably requires at least one GPU to handle attention/kv-cache/delta-net stuff as it is much faster even hybrid CPU+GPU.

Q3_K 179.97 GiB (3.90 BPW)

This is a custom MoE optimized mix similar to AesSedai/ddh0's mixes and likely better than vanilla Q3_K_ mixes. Check the recipe for details.

<details> <summary>👈 Secret Recipe</summary>
#!/usr/bin/env bash

./build/bin/llama-quantize \
    --tensor-type ffn_down_exps=q4_K \
    --tensor-type ffn_gate_exps=q3_K \
    --tensor-type ffn_up_exps=q3_K \
    --token-embedding-type q4_K \
    --output-tensor-type q6_K \
    --imatrix /mnt/data/models/ubergarm/Qwen3.5-397B-A17B-GGUF/imatrix-Qwen3.5-397B-A17B-BF16-mainline.gguf \
    /mnt/data/models/ubergarm/Qwen3.5-397B-A17B-GGUF/Qwen3.5-397B-A17B-BF16-00001-of-00017.gguf \
    /mnt/data/models/ubergarm/Qwen3.5-397B-A17B-GGUF/Qwen3.5-397B-A17B-Q3_K.gguf \
    Q8_0 \
    128
</details>

Quick Start

Example command for mainline llama.cpp for now including mmproj from another repo. Just remove the --mmproj if you don't want image capabilities.

# AMD Thread Ripper Pro (Zen 4) 7965WX 24x Core
# 8x32GiB DDR5@4800 (221.41 GB/s via mlc)
# Dual RTX A6000 (48GB VRAM each)
# Driver: 580.105.08 CUDA: 13.0 P2P: OK NCCL found!

model=/mnt/raid/models/ubergarm/Qwen3.5-397B-A17B-GGUF/Qwen3.5-397B-A17B-Q3_K-00001-of-00005.gguf

./build/bin/llama-server \
    --model "$model"\
    --mmproj /mnt/raid/models/ubergarm/Qwen3.5-397B-A17B-GGUF/mmproj-BF16.gguf \
    --alias ubergarm/Qwen3.5-392B-A17B \
    -fa on \
    --ctx-size 135168 \
    -ctk q8_0 -ctv q8_0 \
    -ub 2048 -b 2048 \
    -fit off \
    -ngl 999 \
    -ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12)\.ffn_(gate|up|down)_exps.*=CUDA0,blk\.(47|47|48|49|50|51|52|53|54|55|56|57|58|59|60)\.ffn_(gate|up|down)_exps.*=CUDA1" \
    --cpu-moe \
    --threads 24 \
    --host 127.0.0.1 \
    --port 8080 \
    --parallel 1 \
    --no-mmap \
    --jinja

prompt eval time =   20177.37 ms /  4036 tokens (    5.00 ms per token,   200.03 tokens per second)
       eval time =  118034.13 ms /  2525 tokens (   46.75 ms per token,    21.39 tokens per second)
      total time =  138211.50 ms /  6561 tokens

prompt eval time =   53071.66 ms / 11154 tokens (    4.76 ms per token,   210.17 tokens per second)
       eval time =    5012.90 ms /   104 tokens (   48.20 ms per token,    20.75 tokens per second)
      total time =   58084.56 ms / 11258 tokens

Vibe Coding

I noticed when trying opencode it is spitting this error, still got the error using a chat template with --chat-template-file myTemplate.jinja from another repo.

Template supports tool calls but does not natively describe tools. The fallback behaviour used may produce bad results, inspect prompt w/ --verbose & consider overriding the template.
srv    operator(): got exception: {"error":{"code":500,"message":"\n------------\nWhile executing FilterExpression at line 120, column 73 in source:\n..._name, args_value in tool_call.arguments|items %}↵                        {{- '<...\n                                           ^\nError: Unknown (built-in) filter 'items' for type String","type":"server_error"}}

Seems like the autoparser branch is working fortunately: https://github.com/ggml-org/llama.cpp/pull/18675

You can get the freshest version like so:

git remote add pwilkin git@github.com:pwilkin/llama.cpp.git
git fetch pwilkin
git checkout pwilkin/autoparser

# compile as normal

References

Author: ubergarm

Likes: 2

Downloads: 0

Tags: gguf, imatrix, conversational, qwen3_5_moe, ik_llama.cpp, text-generation, base_model:Qwen/Qwen3.5-397B-A17B, base_model:quantized:Qwen/Qwen3.5-397B-A17B, license:apache-2.0, endpoints_compatible, region:us

heretic-org/Nanbeige4.1-3B-heretic


license: apache-2.0 language:

  • en
  • zh library_name: transformers pipeline_tag: text-generation tags:
  • llm
  • nanbeige
  • heretic
  • uncensored
  • decensored
  • abliterated base_model:
  • Nanbeige/Nanbeige4.1-3B

This is a decensored version of Nanbeige/Nanbeige4.1-3B, made using Heretic v1.2.0

Abliteration parameters

| Parameter | Value | | :-------- | :---: | | direction_index | 17.48 | | attn.o_proj.max_weight | 3.99 | | attn.o_proj.max_weight_position | 20.00 | | attn.o_proj.min_weight | 3.24 | | attn.o_proj.min_weight_distance | 13.70 | | mlp.down_proj.max_weight | 3.18 | | mlp.down_proj.max_weight_position | 28.59 | | mlp.down_proj.min_weight | 0.33 | | mlp.down_proj.min_weight_distance | 15.40 |

Performance

| Metric | This model | Original model (Nanbeige/Nanbeige4.1-3B) | | :----- | :--------: | :---------------------------: | | KL divergence | 0.0001 | 0 (by definition) | | Refusals | 4/100 | 96/100 |

<div align="left"> <img src="assets/Nanbeige_Nanbeige4.1-3B.gif"> </div>
<div align="center"> <img src="https://huggingface.co/Nanbeige/Nanbeige4.1-3B/resolve/main/figures/nbg.png" width="220" alt="Nanbeige Logo"> </div>

Introduction

Nanbeige4.1-3B is built upon Nanbeige4-3B-Base and represents an enhanced iteration of our previous reasoning model, Nanbeige4-3B-Thinking-2511, achieved through further post-training optimization with supervised fine-tuning (SFT) and reinforcement learning (RL). As a highly competitive open-source model at a small parameter scale, Nanbeige4.1-3B illustrates that compact models can simultaneously achieve robust reasoning, preference alignment, and effective agentic behaviors.

<div align="center"> <img src="https://huggingface.co/Nanbeige/Nanbeige4.1-3B/resolve/main/figures/model_performance_comparison.png"> </div>

Specifically, Nanbeige4.1-3B exhibits the following key strengths:

  • Strong Reasoning: Nanbeige4.1-3B is capable of solving complex, multi-step problems through sustained and coherent reasoning within a single forward pass, and reliably produces correct final answers on challenging tasks such as LiveCodeBench-Pro, IMO-Answer-Bench, and AIME 2026 I.
  • Robust Preference Alignment: Nanbeige4.1-3B achieves solid alignment performance, outperforming not only same-scale models such as Qwen3-4B-2507 and Nanbeige4-3B-2511, but also substantially larger models including Qwen3-30B-A3B and Qwen3-32B on Arena-Hard-v2 and Multi-Challenge.
  • Agentic Capability: Nanbeige4.1-3B is the first general small model to natively support deep-search tasks and reliably sustain complex problem solving involving more than 500 rounds of tool invocations. It fills a long-standing gap in the small-model ecosystem where models are typically optimized for either general reasoning or agentic scenarios, but rarely excel at both.

Technical Report: Link

Performances

We evaluate Nanbeige4.1-3B across a broad and diverse set of benchmarks covering general reasoning, and deep-search capabilities.

General Reasoning Tasks

On general reasoning tasks including code, math, science, alignment, and tool-use benchmarks, Nanbeige4.1-3B not only significantly outperforms same-scale models such as Qwen3-4B, but also demonstrates overall superior performance compared to larger models including Qwen3-30B-A3B-2507 and Qwen3-32B.

| Benchmark | Qwen3-4B-2507 | Qwen3-8B | Qwen3-14B | Qwen3-32B | Qwen3-30B-A3B-2507 | Nanbeige4-3B-2511 | Nanbeige4.1-3B | | --------------------------- | ------------- | -------- | --------- | --------- | ------------------ | ----------------- | ------------------ | | Code | | | | | | | | | Live-Code-Bench-V6 | 57.4 | 49.4 | 55.9 | 55.7 | <u>66.0<u> | 46.0 | 76.9 | | Live-Code-Bench-Pro-Easy | 40.2 | 41.2 | 33.0 | 42.3 | <u>60.8<u> | 40.2 | 81.4 | | Live-Code-Bench-Pro-Mediium | 5.3 | 3.5 | 1.8 | 3.5 | 3.5 | <u>5.3<u> | 28.1 | | Math | | | | | | | | | AIME 2026 I | 81.46 | 70.42 | 76.46 | 75.83 | <u>87.30<u> | 84.1 | 87.40 | | HMMT Nov | 68.33 | 48.33 | 56.67 | 57.08 | <u>71.25<u> | 66.67 | 77.92 | | IMO-Answer-Bench | 48.00 | 36.56 | 41.81 | 43.94 | 54.34 | 38.25 | 53.38 | | Science | | | | | | | | | GPQA | 65.8 | 62.0 | 63.38 | 68.4 | 73.4 | <u>82.2<u> | 83.8 | | HLE (Text-only) | 6.72 | 5.28 | 7.00 | 9.31 | <u>11.77<u> | 10.98 | 12.60 | | Alignment | | | | | | | | | Arena-Hard-v2 | 34.9 | 26.3 | 36.9 | 56.0 | <u>60.2<u> | 60.0 | 73.2 | | Multi-Challenge | 41.14 | 36.30 | 36.97 | 38.72 | <u>49.40<u> | 41.20 | 52.21 | | Tool Use | | | | | | | | | BFCL-V4 | 44.87 | 42.20 | 45.14 | 47.90 | 48.6 | <u>53.8<u> | 56.50 | | Tau2-Bench | 45.9 | 42.06 | 44.96 | 45.26 | <u> 47.70<u> | 41.77 | 48.57 |

Deep Search Tasks

As a general small model, Nanbeige4.1-3B achieves deep-search performance comparable to specialized agents under 10B parameters. In contrast to existing small general models, which typically exhibit little to no deep-search capability, Nanbeige4.1-3B represents a substantial qualitative improvement over prior small general models.

Deep Search and Agent Benchmarks

| Model | xBench-DeepSearch-2505 | xBench-DeepSearch-2510 | Browse-Comp | Browse-Comp-ZH | GAIA (Text-only) | HLE | SEAL-0 | |------|-------------------|-------------------|-------------|----------------|------------------|-----|--------| | Search-Specialized Small Agents |||||||| | MiroThinker-v1.0-8B | 61 | – | 31.1 | 40.2 | 66.4 | 21.5 | 40.4 | | AgentCPM-Explore-4B | 70 | – | 25.0 | 29.0 | 63.9 | 19.1 | 40.0 | | Large Foundation Models (with Tools) |||||||| | GLM-4.6-357B | 70 | – | 45.1 | 49.5 | 71.9 | 30.4 | – | | Minimax-M2-230B | 72 | – | 44.0 | 48.5 | 75.7 | 31.8 | – | | DeepSeek-V3.2-671B | 71 | – | 67.6 | 65.0 | 63.5 | 40.8 | 38.5 | | Small Foundation Models (with Tools) |||||||| | Qwen3-4B-2507 | 34 | 5 | 1.57 | 7.92 | 28.33 | 11.13 | <u>15.74<u> | | Qwen3-8B | 31 | 2 | 0.79 | 5.15 | 19.53 | 10.24 | 6.34 | | Qwen3-14B | 34 | 9 | 2.36 | 7.11 | 30.23 | 10.17 | 12.64 | | Qwen3-32B | <u>39<u> | 8 | <u>3.15<u> | <u>7.34<u> | 30.17 | 9.26 | 8.15 | | Qwen3-30B-A3B-2507 | 25 | 10| 1.57 | 4.12 | <u>31.63<u> | <u>14.81<u> | 9.24 | | Ours (with Tools) |||||||| | Nanbeige4-3B-2511 | 33 | <u>11<u> | 0.79 | 3.09 | 19.42 | 13.89 | 12.61 | | Nanbeige4.1-3B | 75 | 39 | 19.12 | 31.83 | 69.90 | 22.29 | 41.44 |

<span id="Inference">Quickstart</span>

For inference hyperparameters, we recommend the following settings:

  • Temperature: 0.6
  • Top-p: 0.95
  • Repeat penalty: 1.0
  • Max New Tokens: 131072

For the chat scenario:

from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
  'Nanbeige/Nanbeige4.1-3B',
  use_fast=False,
  trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
  'Nanbeige/Nanbeige4.1-3B',
  torch_dtype='auto',
  device_map='auto',
  trust_remote_code=True
)
messages = [
  {'role': 'user', 'content': 'Which number is bigger, 9.11 or 9.8?'}
]
prompt = tokenizer.apply_chat_template(
  messages,
  add_generation_prompt=True,
  tokenize=False
)
input_ids = tokenizer(prompt, add_special_tokens=False, return_tensors='pt').input_ids
output_ids = model.generate(input_ids.to('cuda'), eos_token_id=166101)
resp = tokenizer.decode(output_ids[0][len(input_ids[0]):], skip_special_tokens=True)
print(resp)

For the tool use scenario:

from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
  'Nanbeige/Nanbeige4.1-3B',
  use_fast=False,
  trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
  'Nanbeige/Nanbeige4.1-3B',
  torch_dtype='auto',
  device_map='auto',
  trust_remote_code=True
)
messages = [
    {'role': 'user',  'content': 'Help me check the weather in Beijing now'}
]
tools = [{'type': 'function',
  'function': {'name': 'SearchWeather',
   'description': 'Find out the current weather in a place on a certain day.',
   'parameters': {'type': 'dict',
    'properties': {'location': {'type': 'string',
      'description': 'A city in China.'},
    'required': ['location']}}}}]
prompt = tokenizer.apply_chat_template(
  messages,
  tools,
  add_generation_prompt=True,
  tokenize=False
)
input_ids = tokenizer(prompt, add_special_tokens=False, return_tensors='pt').input_ids
output_ids = model.generate(input_ids.to('cuda'), eos_token_id=166101)
resp = tokenizer.decode(output_ids[0][len(input_ids[0]):], skip_special_tokens=True)
print(resp)

For the deep-search scenario:

  • Inference Framework: miroflow-framework!
  • Switch tokenizer configuration to tokenizer_config_search.json
  • Tools Configuration:

| Server | Description | Tools Provided | |-----------------------------|-----------------------------------------------------------------------------|-------------------------------------------------------------------------------| | tool-python | Execution environment and file management (E2B sandbox) | create_sandbox, run_command, run_python_code, upload_file_from_local_to_sandbox, download_file_from_sandbox_to_local, download_file_from_internet_to_sandbox | | search_and_scrape_webpage | Google search via Serper API | google_search | | jina_scrape_llm_summary | Web scraping with LLM-based information extraction with Jina | scrape_and_extract_info |

  • Summary model: Qwen3-14B-thinking
  • Temperature: 1.0
  • Note, access to HuggingFace has been explicitly disabled in these tools.

<span id="Limitations">Limitations</span>

While we place great emphasis on the safety of the model during the training process, striving to ensure that its outputs align with ethical and legal requirements, it may not completely avoid generating unexpected outputs due to the model's size and probabilistic nature. These outputs may include harmful content such as bias or discrimination. Please don't propagate such content. We do not assume any responsibility for the consequences resulting from the dissemination of inappropriate information. <br>

<span id="Limitations">Contact</span>

If you have any questions, please raise an issue or contact us at nanbeige@kanzhun.com. <br>

Author: heretic-org

Likes: 2

Downloads: 0

Tags: transformers, safetensors, llama, text-generation, llm, nanbeige, heretic, uncensored, decensored, abliterated, conversational, en, zh, base_model:Nanbeige/Nanbeige4.1-3B, base_model:finetune:Nanbeige/Nanbeige4.1-3B, license:apache-2.0, text-generation-inference, endpoints_compatible, region:us