Todays AI Summary

AI Developments: Reasoning, Vision Models, and Efficient Training Take Center Stage

Today's AI landscape is buzzing with advancements across several key areas, from enhancing reasoning capabilities in visual generation to improving the efficiency of training large language models and refining vision models through dataset distillation. Here's a quick rundown of the most interesting developments:

Research Highlights

  • Thinking-while-Generating: A new framework called "Thinking-while-Generating" (TwiG) introduces interleaved textual reasoning throughout the visual generation process. This allows for dynamic interaction between text and image, leading to more context-aware and semantically rich visual outputs. The study explores different strategies, including zero-shot prompting, supervised fine-tuning, and reinforcement learning, to understand the dynamics of interleaved reasoning.
  • Efficient Reasoning RL Training: A system called TLT tackles the efficiency bottlenecks in training reasoning models using Reinforcement Learning (RL). TLT integrates adaptive speculative decoding, using an "Adaptive Drafter" model trained on idle GPUs and an "Adaptive Rollout Engine" to accelerate RL training without compromising model accuracy. TLT achieves over 1.7x end-to-end RL training speedup over state-of-the-art systems.
  • Dataset Distillation for Pre-Trained Vision Models: A new method called Linear Gradient Matching addresses the problem of distilling datasets for training linear probes on top of pre-trained vision models. This method optimizes synthetic images to induce gradients in the linear classifier similar to those produced by real data. The resulting synthetic data outperforms real-image baselines and generalizes across pre-trained vision models.
  • Cognitive Foundations for Reasoning in LLMs: A new study synthesizes cognitive science research into a taxonomy of 28 cognitive elements and analyzes their behavioral manifestations in reasoning traces of LLMs. The analysis reveals systematic structural differences between human and model reasoning, with humans employing hierarchical nesting and meta-cognitive monitoring while models rely on shallow forward chaining.

Model Spotlight

Several new models have been released, focusing on various applications:

  • Ensemble Learning Cloud Classifier: The momererkoc/cloud_classifier model uses ensemble learning to classify cloud images into 7 categories, achieving an F1-score of 0.86. It combines ResNet50, VGG16, and InceptionV3 models with a meta-learner neural network.
  • HoloPASWIN: The gokhankocmarli/holopaswin-v1 model is designed to eliminate the twin-image problem in in-line holography. It uses a physics-aware Swin-UNet architecture trained with synthetic holograms, achieving a Mean Squared Error (MSE) of 0.000642 and a Structural Similarity (SSIM) of 0.9933.
  • McG-221/atom-v1-preview-12b-mlx-8Bit: This model converts the vanta-research/atom-v1-preview-12b model to MLX format for efficient use with Apple silicon.
  • Quantized LLMs: Several quantized versions of large language models have been released, including bartowski/miromind-ai_MiroThinker-v1.0-8B-GGUF, ArtusDev/TheDrummer_Snowpiercer-15B-v4-EXL3, and bartowski/TheDrummer_Snowpiercer-15B-v4-GGUF. These models offer various quantization levels to balance performance and memory usage.

Key Takeaways

  • Reasoning is evolving: Research is pushing towards more dynamic and human-like reasoning processes in AI, particularly in areas like visual generation.
  • Efficiency remains a priority: Efforts to improve the efficiency of training and deploying large models are ongoing, with techniques like adaptive speculative decoding and quantization gaining traction.
  • Vision models are becoming more refined: Dataset distillation and ensemble learning are being used to enhance the performance and generalization capabilities of vision models.
  • Cognitive science offers insights: Bridging cognitive science and LLM research can lead to models that reason through principled cognitive mechanisms, improving their capabilities and interpretability.

AI Papers for 2026-02-19

Perceptive Humanoid Parkour: Chaining Dynamic Human Skills via Motion Matching

While recent advances in humanoid locomotion have achieved stable walking on varied terrains, capturing the agility and adaptivity of highly dynamic human motions remains an open challenge. In particular, agile parkour in complex environments demands not only low-level robustness, but also human-like motion expressiveness, long-horizon skill composition, and perception-driven decision-making. In this paper, we present Perceptive Humanoid Parkour (PHP), a modular framework that enables humanoid robots to autonomously perform long-horizon, vision-based parkour across challenging obstacle courses. Our approach first leverages motion matching, formulated as nearest-neighbor search in a feature space, to compose retargeted atomic human skills into long-horizon kinematic trajectories. This framework enables the flexible composition and smooth transition of complex skill chains while preserving the elegance and fluidity of dynamic human motions. Next, we train motion-tracking reinforcement learning (RL) expert policies for these composed motions, and distill them into a single depth-based, multi-skill student policy, using a combination of DAgger and RL. Crucially, the combination of perception and skill composition enables autonomous, context-aware decision-making: using only onboard depth sensing and a discrete 2D velocity command, the robot selects and executes whether to step over, climb onto, vault or roll off obstacles of varying geometries and heights. We validate our framework with extensive real-world experiments on a Unitree G1 humanoid robot, demonstrating highly dynamic parkour skills such as climbing tall obstacles up to 1.25m (96% robot height), as well as long-horizon multi-obstacle traversal with closed-loop adaptation to real-time obstacle perturbations.

CrispEdit: Low-Curvature Projections for Scalable Non-Destructive LLM Editing

A central challenge in large language model (LLM) editing is capability preservation: methods that successfully change targeted behavior can quietly game the editing proxy and corrupt general capabilities, producing degenerate behaviors reminiscent of proxy/reward hacking. We present CrispEdit, a scalable and principled second-order editing algorithm that treats capability preservation as an explicit constraint, unifying and generalizing several existing editing approaches. CrispEdit formulates editing as constrained optimization and enforces the constraint by projecting edit updates onto the low-curvature subspace of the capability-loss landscape. At the crux of CrispEdit is expressing capability constraint via Bregman divergence, whose quadratic form yields the Gauss-Newton Hessian exactly and even when the base model is not trained to convergence. We make this second-order procedure efficient at the LLM scale using Kronecker-factored approximate curvature (K-FAC) and a novel matrix-free projector that exploits Kronecker structure to avoid constructing massive projection matrices. Across standard model-editing benchmarks, CrispEdit achieves high edit success while keeping capability degradation below 1% on average across datasets, significantly improving over prior editors.

Developing AI Agents with Simulated Data: Why, what, and how?

As insufficient data volume and quality remain the key impediments to the adoption of modern subsymbolic AI, techniques of synthetic data generation are in high demand. Simulation offers an apt, systematic approach to generating diverse synthetic data. This chapter introduces the reader to the key concepts, benefits, and challenges of simulation-based synthetic data generation for AI training purposes, and to a reference framework to describe, design, and analyze digital twin-based AI simulation solutions.

Avey-B

Compact pretrained bidirectional encoders remain the backbone of industrial NLP under tight compute and memory budgets. Their effectiveness stems from self-attention's ability to deliver high-quality bidirectional contextualization with sequence-level parallelism, as popularized by BERT-style architectures. Recently, Avey was introduced as an autoregressive, attention-free alternative that naturally admits an encoder-only adaptation. In this paper, we reformulate Avey for the encoder-only paradigm and propose several innovations to its architecture, including decoupled static and dynamic parameterizations, stability-oriented normalization, and neural compression. Results show that this reformulated architecture compares favorably to four widely used Transformer-based encoders, consistently outperforming them on standard token-classification and information-retrieval benchmarks while scaling more efficiently to long contexts.

Task-Agnostic Continual Learning for Chest Radiograph Classification

Clinical deployment of chest radiograph classifiers requires models that can be updated as new datasets become available without retraining on previously ob- served data or degrading validated performance. We study, for the first time, a task-incremental continual learning setting for chest radiograph classification, in which heterogeneous chest X-ray datasets arrive sequentially and task identifiers are unavailable at inference. We propose a continual adapter-based routing learning strategy for Chest X-rays (CARL-XRay) that maintains a fixed high-capacity backbone and incrementally allocates lightweight task-specific adapters and classifier heads. A latent task selector operates on task-adapted features and leverages both current and historical context preserved through compact prototypes and feature-level experience replay. This design supports stable task identification and adaptation across sequential updates while avoiding raw-image storage. Experiments on large-scale public chest radiograph datasets demonstrate robust performance retention and reliable task-aware inference under continual dataset ingestion. CARL-XRay outperforms joint training under task-unknown deployment, achieving higher routing accuracy (75.0\% vs.\ 62.5\%), while maintaining competitive diagnostic performance with AUROC of 0.74 in the oracle setting with ground-truth task identity and 0.75 under task-unknown inference, using significantly fewer trainable parameters. Finally, the proposed framework provides a practical alternative to joint training and repeated full retraining in continual clinical deployment.

Decision Quality Evaluation Framework at Pinterest

Online platforms require robust systems to enforce content safety policies at scale. A critical component of these systems is the ability to evaluate the quality of moderation decisions made by both human agents and Large Language Models (LLMs). However, this evaluation is challenging due to the inherent trade-offs between cost, scale, and trustworthiness, along with the complexity of evolving policies. To address this, we present a comprehensive Decision Quality Evaluation Framework developed and deployed at Pinterest. The framework is centered on a high-trust Golden Set (GDS) curated by subject matter experts (SMEs), which serves as a ground truth benchmark. We introduce an automated intelligent sampling pipeline that uses propensity scores to efficiently expand dataset coverage. We demonstrate the framework's practical application in several key areas: benchmarking the cost-performance trade-offs of various LLM agents, establishing a rigorous methodology for data-driven prompt optimization, managing complex policy evolution, and ensuring the integrity of policy content prevalence metrics via continuous validation. The framework enables a shift from subjective assessments to a data-driven and quantitative practice for managing content safety systems.

The Geometry of Alignment Collapse: When Fine-Tuning Breaks Safety

Fine-tuning aligned language models on benign tasks unpredictably degrades safety guardrails, even when training data contains no harmful content and developers have no adversarial intent. We show that the prevailing explanation, that fine-tuning updates should be orthogonal to safety-critical directions in high-dimensional parameter space, offers false reassurance: we show this orthogonality is structurally unstable and collapses under the dynamics of gradient descent. We then resolve this through a novel geometric analysis, proving that alignment concentrates in low-dimensional subspaces with sharp curvature, creating a brittle structure that first-order methods cannot detect or defend. While initial fine-tuning updates may indeed avoid these subspaces, the curvature of the fine-tuning loss generates second-order acceleration that systematically steers trajectories into alignment-sensitive regions. We formalize this mechanism through the Alignment Instability Condition, three geometric properties that, when jointly satisfied, lead to safety degradation. Our main result establishes a quartic scaling law: alignment loss grows with the fourth power of training time, governed by the sharpness of alignment geometry and the strength of curvature coupling between the fine-tuning task and safety-critical parameters. These results expose a structural blind spot in the current safety paradigm. The dominant approaches to safe fine-tuning address only the initial snapshot of a fundamentally dynamic problem. Alignment fragility is not a bug to be patched; it is an intrinsic geometric property of gradient descent on curved manifolds. Our results motivate the development of curvature-aware methods, and we hope will further enable a shift in alignment safety analysis from reactive red-teaming to predictive diagnostics for open-weight model deployment.

Enhancing Building Semantics Preservation in AI Model Training with Large Language Model Encodings

Accurate representation of building semantics, encompassing both generic object types and specific subtypes, is essential for effective AI model training in the architecture, engineering, construction, and operation (AECO) industry. Conventional encoding methods (e.g., one-hot) often fail to convey the nuanced relationships among closely related subtypes, limiting AI's semantic comprehension. To address this limitation, this study proposes a novel training approach that employs large language model (LLM) embeddings (e.g., OpenAI GPT and Meta LLaMA) as encodings to preserve finer distinctions in building semantics. We evaluated the proposed method by training GraphSAGE models to classify 42 building object subtypes across five high-rise residential building information models (BIMs). Various embedding dimensions were tested, including original high-dimensional LLM embeddings (1,536, 3,072, or 4,096) and 1,024-dimensional compacted embeddings generated via the Matryoshka representation model. Experimental results demonstrated that LLM encodings outperformed the conventional one-hot baseline, with the llama-3 (compacted) embedding achieving a weighted average F1-score of 0.8766, compared to 0.8475 for one-hot encoding. The results underscore the promise of leveraging LLM-based encodings to enhance AI's ability to interpret complex, domain-specific building semantics. As the capabilities of LLMs and dimensionality reduction techniques continue to evolve, this approach holds considerable potential for broad application in semantic elaboration tasks throughout the AECO industry.

This human study did not involve human subjects: Validating LLM simulations as behavioral evidence

A growing literature uses large language models (LLMs) as synthetic participants to generate cost-effective and nearly instantaneous responses in social science experiments. However, there is limited guidance on when such simulations support valid inference about human behavior. We contrast two strategies for obtaining valid estimates of causal effects and clarify the assumptions under which each is suitable for exploratory versus confirmatory research. Heuristic approaches seek to establish that simulated and observed human behavior are interchangeable through prompt engineering, model fine-tuning, and other repair strategies designed to reduce LLM-induced inaccuracies. While useful for many exploratory tasks, heuristic approaches lack the formal statistical guarantees typically required for confirmatory research. In contrast, statistical calibration combines auxiliary human data with statistical adjustments to account for discrepancies between observed and simulated responses. Under explicit assumptions, statistical calibration preserves validity and provides more precise estimates of causal effects at lower cost than experiments that rely solely on human participants. Yet the potential of both approaches depends on how well LLMs approximate the relevant populations. We consider what opportunities are overlooked when researchers focus myopically on substituting LLMs for human participants in a study.

GlobeDiff: State Diffusion Process for Partial Observability in Multi-Agent Systems

In the realm of multi-agent systems, the challenge of \emph{partial observability} is a critical barrier to effective coordination and decision-making. Existing approaches, such as belief state estimation and inter-agent communication, often fall short. Belief-based methods are limited by their focus on past experiences without fully leveraging global information, while communication methods often lack a robust model to effectively utilize the auxiliary information they provide. To solve this issue, we propose Global State Diffusion Algorithm~(GlobeDiff) to infer the global state based on the local observations. By formulating the state inference process as a multi-modal diffusion process, GlobeDiff overcomes ambiguities in state estimation while simultaneously inferring the global state with high fidelity. We prove that the estimation error of GlobeDiff under both unimodal and multi-modal distributions can be bounded. Extensive experimental results demonstrate that GlobeDiff achieves superior performance and is capable of accurately inferring the global state.

AI Models

Qwen/Qwen3.5-397B-A17B-FP8


library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3.5-397B-A17B/blob/main/LICENSE pipeline_tag: image-text-to-text

Qwen3.5-397B-A17B

<img width="400px" src="https://qianwen-res.oss-accelerate.aliyuncs.com/logo_qwen3.5.png">

Qwen Chat

[!Note] This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format.

These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, etc.

[!Tip] For users seeking managed, scalable inference without infrastructure maintenance, the official Qwen API service is provided by Alibaba Cloud Model Studio.

In particular, Qwen3.5-Plus is the hosted version corresponding to Qwen3.5-397B-A17B with more production features, e.g., 1M context length by default, official built-in tools, and adaptive tool use. For more information, please refer to the User Guide.

Over recent months, we have intensified our focus on developing foundation models that deliver exceptional utility and performance. Qwen3.5 represents a significant leap forward, integrating breakthroughs in multimodal learning, architectural efficiency, reinforcement learning scale, and global accessibility to empower developers and enterprises with unprecedented capability and efficiency.

Qwen3.5 Highlights

Qwen3.5 features the following enhancement:

  • Unified Vision-Language Foundation: Early fusion training on multimodal tokens achieves cross-generational parity with Qwen3 and outperforms Qwen3-VL models across reasoning, coding, agents, and visual understanding benchmarks.

  • Efficient Hybrid Architecture: Gated Delta Networks combined with sparse Mixture-of-Experts deliver high-throughput inference with minimal latency and cost overhead.

  • Scalable RL Generalization: Reinforcement learning scaled across million-agent environments with progressively complex task distributions for robust real-world adaptability.

  • Global Linguistic Coverage: Expanded support to 201 languages and dialects, enabling inclusive, worldwide deployment with nuanced cultural and regional understanding.

  • Next-Generation Training Infrastructure: Near-100% multimodal training efficiency compared to text-only training and asynchronous RL frameworks supporting massive-scale agent scaffolds and environment orchestration.

Benchmark Results

For more details, please refer to our blog post Qwen3.5.

Model Overview

  • Type: Causal Language Model with Vision Encoder
  • Training Stage: Pre-training & Post-training
  • Language Model
    • Number of Parameters: 397B in total and 17B activated
    • Hidden Dimension: 4096
    • Token Embedding: 248320 (Padded)
    • Number of Layers: 60
      • Hidden Layout: 15 * (3 * (Gated DeltaNet -> MoE) -> 1 * (Gated Attention -> MoE))
    • Gated DeltaNet:
      • Number of Linear Attention Heads: 64 for V and 16 for QK
      • Head Dimension: 128
    • Gated Attention:
      • Number of Attention Heads: 32 for Q and 2 for KV
      • Head Dimension: 256
      • Rotary Position Embedding Dimension: 64
    • Mixture Of Experts
      • Number of Experts: 512
      • Number of Activated Experts: 10 Routed + 1 Shared
      • Expert Intermediate Dimension: 1024
    • LM Output: 248320 (Padded)
    • MTP: trained with multi-steps
  • Context Length: 262,144 natively and extensible up to 1,010,000 tokens.

Benchmark Results

Language

<div style="font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,sans-serif;max-width:900px;margin:0 auto;padding:16px 0"> <table style="width:100%;border-collapse:collapse;font-size:13px"> <thead><tr> <th style="padding:10px 12px;text-align:left;font-weight:600;border-bottom:2px solid #7c3aed;color:#7c3aed"></th> <th style="padding:10px 12px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">GPT5.2</th> <th style="padding:10px 12px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">Claude 4.5 Opus</th> <th style="padding:10px 12px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">Gemini-3 Pro</th> <th style="padding:10px 12px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">Qwen3-Max-Thinking</th> <th style="padding:10px 12px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">K2.5-1T-A32B</th> <th style="padding:10px 12px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">Qwen3.5-397B-A17B</th> </tr></thead> <tbody> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">Knowledge</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MMLU-Pro</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">89.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">89.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.8</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MMLU-Redux</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">95.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">95.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">95.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">92.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">94.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">94.9</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">SuperGPQA</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">67.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">74.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">67.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">69.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.4</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">C-Eval</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">92.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">94.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.0</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">Instruction Following</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">IFEval</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">94.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">92.6</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">IFBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">75.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">58.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">76.5</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MultiChallenge</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">57.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">54.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">64.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">63.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">62.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">67.6</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">Long Context</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">AA-LCR</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">72.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">74.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.7</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">LongBench v2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">54.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">64.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">60.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">61.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">63.2</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">STEM</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">GPQA</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">92.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">91.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">88.4</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">HLE</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">35.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">30.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">37.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">30.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">30.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">28.7</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">HLE-Verified¹</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">43.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">38.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">48</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">37.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">37.6</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">Reasoning</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">LiveCodeBench v6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.6</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">HMMT Feb 25</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">99.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">92.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">97.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">98.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">95.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">94.8</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">HMMT Nov 25</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">100</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">94.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">91.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">92.7</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">IMOAnswerBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.9</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">AIME26</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">96.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">91.3</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">General Agent</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">BFCL-V4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">63.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">72.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">67.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">72.9</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">TAU2-Bench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">91.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.7</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">VITA-Bench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">38.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">56.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">51.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">40.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">41.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">49.7</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">DeepPlanning</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">44.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">33.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">23.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">28.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">14.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">34.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">Tool Decathlon</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">43.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">43.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">36.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">18.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">27.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">38.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MCP-Mark</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">57.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">42.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">53.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">33.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">29.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">46.1</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">Search Agent³</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">HLE w/ tool</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">45.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">43.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">45.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">49.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">50.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">48.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">BrowseComp</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">65.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">67.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">59.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">53.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--/74.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">69.0/78.6</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">BrowseComp-zh</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">76.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">62.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">66.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">60.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">WideSearch</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">76.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">76.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">57.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">72.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">74.0</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">Seal-0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">45.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">47.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">45.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">46.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">57.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">46.9</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">Multilingualism</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MMMLU</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">89.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">88.5</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MMLU-ProX</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">78.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">82.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.7</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">NOVA-63</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">54.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">56.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">56.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">54.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">56.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">59.1</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">INCLUDE</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">82.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.6</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">Global PIQA</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">91.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">89.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">89.8</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">PolyMATH</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">62.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">79.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">64.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">43.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">73.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">WMT24++</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">78.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">79.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">78.9</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MAXIFE</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">88.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">79.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">72.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">88.2</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">Coding Agent</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">SWE-bench Verified</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">76.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">75.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">76.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">76.4</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">SWE-bench Multilingual</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">72.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">65.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">66.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">73.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">69.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">SecCodeBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">62.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">57.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">61.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">Terminal Bench 2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">54.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">59.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">54.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">22.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">50.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">52.5</td> </tr> </tbody> </table> <p style="margin-top:12px;font-size:11px;opacity:0.7"> * HLE-Verified: a verified and revised version of Humanity’s Last Exam (HLE), accompanied by a transparent, component-wise verification protocol and a fine-grained error taxonomy. We open-source the dataset at https://huggingface.co/datasets/skylenage/HLE-Verified.<br> * TAU2-Bench: we follow the official setup except for the airline domain, where all models are evaluated by applying the fixes proposed in the Claude Opus 4.5 system card.<br> * MCPMark: GitHub MCP server uses v0.30.3 from api.githubcopilot.com; Playwright tool responses are truncated at 32k tokens.<br> * Search Agent: most search agents built on our model adopt a simple context-folding strategy(256k): once the cumulative Tool Response length reaches a preset threshold, earlier Tool Responses are pruned from the history to keep the context within limits.<br> * BrowseComp: we tested two strategies, simple context-folding achieved a score of 69.0, while using the same discard-all strategy as DeepSeek-V3.2 and Kimi K2.5 achieved 78.6.<br> * WideSearch: we use a 256k context window without any context management.<br> * MMLU-ProX: we report the averaged accuracy on 29 languages.<br> * WMT24++: a harder subset of WMT24 after difficulty labeling and rebalancing; we report the averaged scores on 55 languages using XCOMET-XXL.<br> * MAXIFE: we report the accuracy on English + multilingual original prompts (totally 23 settings).<br> * Empty cells (--) indicate scores not yet available or not applicable.<br> </p> </div>

Vision Language

<div style="font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,sans-serif;max-width:900px;margin:0 auto;padding:16px 0"> <table style="width:100%;border-collapse:collapse;font-size:13px"> <thead><tr> <th style="padding:10px 12px;text-align:left;font-weight:600;border-bottom:2px solid #7c3aed;color:#7c3aed"></th> <th style="padding:10px 12px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">GPT5.2</th> <th style="padding:10px 12px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">Claude 4.5 Opus</th> <th style="padding:10px 12px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">Gemini-3 Pro</th> <th style="padding:10px 12px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">Qwen3-VL-235B-A22B</th> <th style="padding:10px 12px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">K2.5-1T-A32B</th> <th style="padding:10px 12px;text-align:center;font-weight:500;border-bottom:2px solid #7c3aed;color:#7c3aed;font-size: 14px;">Qwen3.5-397B-A17B</th> </tr></thead> <tbody> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">STEM and Puzzle</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MMMU</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.0</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MMMU-Pro</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">79.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">69.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">78.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">79.0</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MathVision</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">74.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">74.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">88.6</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">Mathvista(mini)</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">We-Math</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">79.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">74.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.9</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">DynaMath</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">79.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">82.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">ZEROBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">10</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">12</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">ZEROBench_sub</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">33.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">28.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">39.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">28.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">33.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">41.0</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">BabyVision</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">34.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">14.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">49.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">22.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">36.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">52.3/43.3</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">General VQA</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">RealWorldQA</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.9</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MMStar</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">73.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">78.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.8</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">HallusionBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">65.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">64.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">66.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">69.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">71.4</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MMBench<sub><small>EN-DEV-v1.1</small></sub></td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">88.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">89.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">89.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">94.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.7</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">SimpleVQA</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">55.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">65.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">73.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">61.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">71.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">67.1</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">Text Recognition and Document Understanding</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">OmniDocBench1.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">88.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">88.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.8</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">CharXiv(RQ)</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">82.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">66.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.8</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MMLongBench-Doc</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">61.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">60.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">56.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">58.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">61.5</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">CC-OCR</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">76.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">79.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">79.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">82.0</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">AI2D_TEST</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">92.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">94.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">89.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.9</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">OCRBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">92.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.1</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">Spatial Intelligence</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">ERQA</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">59.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">46.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">52.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">67.5</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">CountBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">91.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">90.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">97.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">93.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">94.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">97.2</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">RefCOCO(avg)</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">91.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">92.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">ODInW13</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">46.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">43.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">47.0</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">EmbSpatialBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">75.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">61.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.5</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">RefSpatialBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">65.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">69.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">73.6</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">LingoQA</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">78.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">72.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">66.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">68.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.6</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">V*</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">75.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">67.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">88.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">95.8/91.1</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">Hypersim</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">11.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">12.5</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">SUNRGBD</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">34.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">38.3</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">Nuscene</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">13.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">16.0</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">Video Understanding</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">VideoMME<sub><small>(w sub.)</sub></small></td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">88.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.5</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">VideoMME<sub><small>(w/o sub.)</sub></small></td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">79.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.7</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">VideoMMMU</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">87.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">84.7</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MLVU (M-Avg)</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">83.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">85.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">86.7</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MVBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">78.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">67.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">74.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">75.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">73.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.6</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">LVBench</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">73.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">57.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">76.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">63.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">75.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">75.5</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MMVU</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.8</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">77.5</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">71.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">80.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">75.4</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">Visual Agent</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">ScreenSpot Pro</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">45.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">72.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">62.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">65.6</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">OSWorld-Verified</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">38.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">66.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">38.1</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">63.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">62.2</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">AndroidWorld</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">63.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">--</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">66.8</td> </tr> <tr><td colspan="7" style="padding:8px 12px;font-weight:600;color:#7c3aed;border-bottom:1px solid rgba(124, 58, 237, 0.2);background:rgba(124, 58, 237, 0.1)">Medical VQA</td></tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">SLAKE</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">76.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">76.4</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">54.7</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">81.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">79.9</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">PMC-VQA</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">58.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">59.9</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">62.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">41.2</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">63.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">64.2</td> </tr> <tr> <td style="padding:7px 12px;padding-left:20px;border-bottom:1px solid rgba(128, 128, 128, 0.15);">MedXpertQA-MM</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">73.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">63.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">76.0</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">47.6</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">65.3</td> <td style="padding:7px 12px;text-align:center;border-bottom:1px solid rgba(128, 128, 128, 0.15)">70.0</td> </tr> </tbody> </table> <p style="margin-top:12px;font-size:11px;opacity:0.7"> * MathVision:our model’s score is evaluated using a fixed prompt, e.g., “Please reason step by step, and put your final answer within \boxed{}.” For other models, we report the higher score between runs with and without the \boxed{} formatting.<br> * BabyVision: our model’s score is reported with CI (Code Interpreter) enabled; without CI, the result is 43.3.<br> * V*: our model’s score is reported with CI (Code Interpreter) enabled; without CI, the result is 91.1.<br> * Empty cells (--) indicate scores not yet available or not applicable.<br> </p> </div>

Quickstart

[!Important] Qwen3.5 models operate in thinking mode by default, generating thinking content signified by <think>\n...</think>\n\n before producing the final responses. To disable thinking content and obtain direct response, refer to the examples here.

For streamlined integration, we recommend using Qwen3.5 via APIs. Below is a guide to use Qwen3.5 via OpenAI-compatible API.

Serving Qwen3.5

Qwen3.5 can be served via APIs with popular inference frameworks. In the following, we show example commands to launch OpenAI-Compatible API servers for Qwen3.5 models.

[!Important] Inference efficiency and throughput vary significantly across frameworks. We recommend using the latest framework versions to ensure optimal performance and compatibility. For production workloads or high-throughput scenarios, dedicated serving engines such as SGLang or vLLM are strongly recommended.

[!Important] The model has a default context length of 262,144 tokens. If you encounter out-of-memory (OOM) errors, consider reducing the context window. However, because Qwen3.5 leverages extended context for complex tasks, we advise maintaining a context length of at least 128K tokens to preserve thinking capabilities.

SGLang

SGLang is a fast serving framework for large language models and vision language models. SGLang from the main branch of the open-source repository is required for Qwen3.5, which can be installed using the following command in a fresh environment:

uv pip install 'git+https://github.com/sgl-project/sglang.git#subdirectory=python&egg=sglang[all]'

See its documentation for more details.

The following will create API endpoints at http://localhost:8000/v1:

  • Standard Version: The following command can be used to create an API endpoint with maximum context length 262,144 tokens using tensor parallel on 8 GPUs.

    python -m sglang.launch_server --model-path Qwen/Qwen3.5-397B-A17B-FP8 --port 8000 --tp-size 8 --mem-fraction-static 0.8 --context-length 262144 --reasoning-parser qwen3
    
  • Tool Use: To support tool use, you can use the following command.

    python -m sglang.launch_server --model-path Qwen/Qwen3.5-397B-A17B-FP8 --port 8000 --tp-size 8 --mem-fraction-static 0.8 --context-length 262144 --reasoning-parser qwen3 --tool-call-parser qwen3_coder
    
  • Multi-Token Prediction (MTP): The following command is recommended for MTP:

    python -m sglang.launch_server --model-path Qwen/Qwen3.5-397B-A17B-FP8 --port 8000 --tp-size 8 --mem-fraction-static 0.8 --context-length 262144 --reasoning-parser qwen3 --speculative-algo NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4
    

vLLM

vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. vLLM from the main branch of the open-source repository is required for Qwen3.5, which can be installed using the following command in a fresh environment:

uv pip install vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly

See its documentation for more details.

For detailed Qwen3.5 usage guide, see the vLLM Qwen3.5 recipe.

The following will create API endpoints at http://localhost:8000/v1:

  • Standard Version: The following command can be used to create an API endpoint with maximum context length 262,144 tokens using tensor parallel on 8 GPUs.

    vllm serve Qwen/Qwen3.5-397B-A17B-FP8 --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 
    
  • Tool Call: To support tool use, you can use the following command.

    vllm serve Qwen/Qwen3.5-397B-A17B-FP8 --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder 
    
  • Multi-Token Prediction (MTP): The following command is recommended for MTP:

    vllm serve Qwen/Qwen3.5-397B-A17B-FP8 --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
    
  • Text-Only: The following command skips the vision encoder and multimodal profiling to free up memory for additional KV cache:

    vllm serve Qwen/Qwen3.5-397B-A17B-FP8 --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 --language-model-only
    

Using Qwen3.5 via the Chat Completions API

The chat completions API is accessible via standard HTTP requests or OpenAI SDKs. Here, we show examples using the OpenAI Python SDK.

Before starting, make sure it is installed and the API key and the API base URL is configured, e.g.:

pip install -U openai

# Set the following accordingly
export OPENAI_BASE_URL="http://localhost:8000/v1"
export OPENAI_API_KEY="EMPTY"

[!Tip] We recommend using the following set of sampling parameters for generation

  • Thinking mode: temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0
  • Instruct (or non-thinking) mode: temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

Please note that the support for sampling parameters varies according to inference frameworks.

Text-Only Input

from openai import OpenAI
# Configured by environment variables
client = OpenAI()

messages = [
    {"role": "user", "content": "Type \"I love Qwen3.5\" backwards"},
]

chat_response = client.chat.completions.create(
    model="Qwen/Qwen3.5-397B-A17B-FP8",
    messages=messages,
    max_tokens=81920,
    temperature=0.6,
    top_p=0.95,
    extra_body={
        "top_k": 20,
    }, 
)
print("Chat response:", chat_response)

Image Input

from openai import OpenAI
# Configured by environment variables
client = OpenAI()

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.5/demo/CI_Demo/mathv-1327.jpg"
                }
            },
            {
                "type": "text",
                "text": "The centres of the four illustrated circles are in the corners of the square. The two big circles touch each other and also the two little circles. With which factor do you have to multiply the radii of the little circles to obtain the radius of the big circles?\nChoices:\n(A) $\\frac{2}{9}$\n(B) $\\sqrt{5}$\n(C) $0.8 \\cdot \\pi$\n(D) 2.5\n(E) $1+\\sqrt{2}$"
            }
        ]
    }
]

response = client.chat.completions.create(
    model="Qwen/Qwen3.5-397B-A17B",
    messages=messages,
    max_tokens=81920,
    temperature=0.6,
    top_p=0.95,
    extra_body={
        "top_k": 20,
    }, 
)
print("Chat response:", chat_response)

Video Input

from openai import OpenAI
# Configured by environment variables
client = OpenAI()

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video_url",
                "video_url": {
                    "url": "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.5/demo/video/N1cdUjctpG8.mp4"
                }
            },
            {
                "type": "text",
                "text": "How many porcelain jars were discovered in the niches located in the primary chamber of the tomb?"
            }
        ]
    }
]

# When vLLM is launched with `--media-io-kwargs '{"video": {"num_frames": -1}}'`,
# video frame sampling can be configured via `extra_body` (e.g., by setting `fps`).
# This feature is currently supported only in vLLM.
#
# By default, `fps=2` and `do_sample_frames=True`.
# With `do_sample_frames=True`, you can customize the `fps` value to set your desired video sampling rate.
response = client.chat.completions.create(
    model="Qwen/Qwen3.5-397B-A17B",
    messages=messages,
    max_tokens=81920,
    temperature=0.6,
    top_p=0.95,
    extra_body={
        "top_k": 20,
        "mm_processor_kwargs": {"fps": 2, "do_sample_frames": True},
    }, 
)

print("Chat response:", chat_response)

Instruct (or Non-Thinking) Mode

[!Important] Qwen3.5 does not officially support the soft switch of Qwen3, i.e., /think and /nothink.

Qwen3.5 will think by default before response. You can obtain direct response from the model without thinking by configuring the API parameters. For example,

from openai import OpenAI
# Configured by environment variables
client = OpenAI()

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.5/demo/RealWorld/RealWorld-04.png"
                }
            },
            {
                "type": "text",
                "text": "Where is this?"
            }
        ]
    }
]

chat_response = client.chat.completions.create(
    model="Qwen/Qwen3.5-397B-A17B-FP8",
    messages=messages,
    max_tokens=32768,
    temperature=0.7,
    top_p=0.8,
    presence_penalty=1.5,
    extra_body={
        "top_k": 20,
        "chat_template_kwargs": {"enable_thinking": False},
    }, 
)
print("Chat response:", chat_response)

[!Note] If you are using APIs from Alibaba Cloud Model Studio, in addition to changing model, please use "enable_thinking": False instead of "chat_template_kwargs": {"enable_thinking": False}.

Agentic Usage

Qwen3.5 excels in tool calling capabilities.

Qwen-Agent

We recommend using Qwen-Agent to quickly build Agent applications with Qwen3.5.

To define the available tools, you can use the MCP configuration file, use the integrated tool of Qwen-Agent, or integrate other tools by yourself.

import os
from qwen_agent.agents import Assistant

# Define LLM
# Using Alibaba Cloud Model Studio
llm_cfg = {
    # Use the OpenAI-compatible model service provided by DashScope:
    'model': 'Qwen3.5-397B-A17B-FP8',
    'model_type': 'qwenvl_oai',
    'model_server': 'https://dashscope.aliyuncs.com/compatible-mode/v1',
    'api_key': os.getenv('DASHSCOPE_API_KEY'),

    'generate_cfg': {
        'use_raw_api': True,
        # When using Dash Scope OAI API, pass the parameter of whether to enable thinking mode in this way
        'extra_body': {
            'enable_thinking': True
        },
    },
}

# Using OpenAI-compatible API endpoint.
# functionality of the deployment frameworks and let Qwen-Agent automate the related operations.
#
# llm_cfg = {
#     # Use your own model service compatible with OpenAI API by vLLM/SGLang:
#     'model': 'Qwen/Qwen3.5-397B-A17B-FP8',
#     'model_type': 'qwenvl_oai',
#     'model_server': 'http://localhost:8000/v1',  # api_base
#     'api_key': 'EMPTY',
#
#     'generate_cfg': {
#         'use_raw_api': True,
#         # When using vLLM/SGLang OAI API, pass the parameter of whether to enable thinking mode in this way
#         'extra_body': {
#             'chat_template_kwargs': {'enable_thinking': True}
#         },
#     },
# }

# Define Tools
tools = [
    {'mcpServers': {  # You can specify the MCP configuration file
            "filesystem": {
                "command": "npx",
                "args": ["-y", "@modelcontextprotocol/server-filesystem", "/Users/xxxx/Desktop"]
            }
        }
    }
]

# Define Agent
bot = Assistant(llm=llm_cfg, function_list=tools)

# Streaming generation
messages = [{'role': 'user', 'content': 'Help me organize my desktop.'}]
for responses in bot.run(messages=messages):
    pass
print(responses)

# Streaming generation
messages = [{'role': 'user', 'content': 'Develop a dog website and save it on the desktop'}]
for responses in bot.run(messages=messages):
    pass
print(responses)

Qwen Code

Qwen Code is an open-source AI agent for the terminal, optimized for Qwen models. It helps you understand large codebases, automate tedious work, and ship faster.

For more information, please refer to Qwen Code.

Processing Ultra-Long Texts

Qwen3.5 natively supports context lengths of up to 262,144 tokens. For long-horizon tasks where the total length (including both input and output) exceeds this limit, we recommend using RoPE scaling techniques to handle long texts effectively., e.g., YaRN.

YaRN is currently supported by several inference frameworks, e.g., transformers, vllm and sglang. In general, there are two approaches to enabling YaRN for supported frameworks:

  • Modifying the model configuration file: In the config.json file, change the rope_parameters fields in text_config to:

    {
        "mrope_interleaved": true,
        "mrope_section": [
            11,
            11,
            10
        ],
        "rope_type": "yarn",
        "rope_theta": 10000000,
        "partial_rotary_factor": 0.25,
        "factor": 4.0,
        "original_max_position_embeddings": 262144,
    }
    
  • Passing command line arguments:

    For vllm, you can use

    VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve ... --hf-overrides '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}' --max-model-len 1010000  
    

    For sglang, you can use

    SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python -m sglang.launch_server ... --json-model-override-args '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}' --context-length 1010000
    

[!NOTE] All the notable open-source frameworks implement static YaRN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise modifying the rope_parameters configuration only when processing long contexts is required. It is also recommended to modify the factor as needed. For example, if the typical context length for your application is 524,288 tokens, it would be better to set factor as 2.0.

Best Practices

To achieve optimal performance, we recommend the following settings:

  1. Sampling Parameters:

    • We suggest using Temperature=0.6, TopP=0.95, TopK=20, and MinP=0 for thinking mode and using Temperature=0.7, TopP=0.8, TopK=20, and MinP=0 for non-thinking mode.
    • For supported frameworks, you can adjust the presence_penalty parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance.
  2. Adequate Output Length: We recommend using an output length of 32,768 tokens for most queries. For benchmarking on highly complex problems, such as those found in math and programming competitions, we suggest setting the max output length to 81,920 tokens. This provides the model with sufficient space to generate detailed and comprehensive responses, thereby enhancing its overall performance.

  3. Standardize Output Format: We recommend using prompts to standardize model outputs when benchmarking.

    • Math Problems: Include "Please reason step by step, and put your final answer within \boxed{}." in the prompt.
    • Multiple-Choice Questions: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the answer field with only the choice letter, e.g., "answer": "C"."
  4. No Thinking Content in History: In multi-turn conversations, the historical model output should only include the final output part and does not need to include the thinking content. It is implemented in the provided chat template in Jinja2. However, for frameworks that do not directly use the Jinja2 chat template, it is up to the developers to ensure that the best practice is followed.

  5. Long Video Understanding: To optimize inference efficiency for plain text and images, the size parameter in the released video_preprocessor_config.json is conservatively configured. It is recommended to set the longest_edge parameter in the video_preprocessor_config file to 469,762,048 (corresponding to 224k video tokens) to enable higher frame-rate sampling for hour-scale videos and thereby achieve superior performance. For example,

    {"longest_edge": 469762048, "shortest_edge": 4096}
    

    Alternatively, override the default values via engine startup parameters. For implementation details, refer to: vLLM / SGLang.

Citation

If you find our work helpful, feel free to give us a cite.

@misc{qwen3.5
    title  = {{Qwen3.5}: Towards Native Multimodal Agents},
    author = {{Qwen Team}},
    month  = {February},
    year   = {2026},
    url    = {https://qwen.ai/blog?id=qwen3.5}
}

Author: Qwen

Likes: 19

Downloads: 0

Tags: transformers, safetensors, qwen3_5_moe, image-text-to-text, conversational, license:apache-2.0, endpoints_compatible, fp8, region:us

mmnga-o/NVIDIA-Nemotron-Nano-9B-v2-Japanese-gguf


library_name: transformers license: other license_name: nvidia-open-model-license license_link: >- https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/ pipeline_tag: text-generation language:

  • en
  • ja tags:
  • nvidia base_model:
  • nvidia/NVIDIA-Nemotron-Nano-9B-v2-Japanese datasets:
  • TFMC/imatrix-dataset-for-japanese-llm track_downloads: true

NVIDIA-Nemotron-Nano-9B-v2-Japanese-gguf

nvidiaさんが公開しているNVIDIA-Nemotron-Nano-9B-v2-Japaneseのggufフォーマット変換版です。

imatrixのデータはTFMC/imatrix-dataset-for-japanese-llmを使用して作成しました。

Usage

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release
build/bin/llama-cli -m 'NVIDIA-Nemotron-Nano-9B-v2-Japanese-gguf' -n 128 -c 128 -p 'あなたはプロの料理人です。レシピを教えて'

Author: mmnga-o

Likes: 11

Downloads: 0

Tags: transformers, gguf, nvidia, text-generation, en, ja, dataset:TFMC/imatrix-dataset-for-japanese-llm, base_model:nvidia/NVIDIA-Nemotron-Nano-9B-v2-Japanese, base_model:quantized:nvidia/NVIDIA-Nemotron-Nano-9B-v2-Japanese, license:other, endpoints_compatible, region:us, imatrix, conversational

cerebras/MiniMax-M2.5-REAP-172B-A10B


language:

  • en library_name: transformers tags:
  • minimax
  • MOE
  • pruning
  • compression license: other name: cerebras/MiniMax-M2.5-REAP-172B-A10B description: > This model was obtained by uniformly pruning 25% of experts in MiniMax-M2.5 using the REAP method. readme: > https://huggingface.co/cerebras/MiniMax-M2.5-REAP-172B-A10B/main/README.md pipeline_tag: text-generation base_model:
  • MiniMaxAI/MiniMax-M2.5

<p align="center"> <em>𓌳 <strong>REAP</strong>𓌳 the Experts: Why Pruning Prevails for One-Shot MoE Compression</em><br> <img src="https://i.imgur.com/rmzG3gg.png" alt="REAP" width="75%"> </p>

MiniMax-M2.5-REAP-172B-A10B

✨ Highlights

Introducing MiniMax-M2.5-REAP-172B-A10B, a memory-efficient compressed variant of MiniMax-M2.5 that maintains near-identical performance while being 25% lighter.

This model was created using REAP (Router-weighted Expert Activation Pruning), a novel expert pruning method that selectively removes redundant experts while preserving the router's independent control over remaining experts. Key features include:

  • Near-Lossless Performance: Maintains almost identical accuracy on code generation, agentic coding, and function calling tasks compared to the full 230B model
  • 25% Memory Reduction: Compressed from 230B to 172B parameters, significantly lowering deployment costs and memory requirements
  • Preserved Capabilities: Retains all core functionalities including code generation, math & reasoning and tool calling.
  • Drop-in Compatibility: Works with vanilla vLLM - no source modifications or custom patches required
  • Optimized for Real-World Use: Particularly effective for resource-constrained environments, local deployments, and academic research

📋 Model Overview

MiniMax-M2.5-REAP-172B-A10B has the following specifications:

  • Base Model: MiniMax-M2.5
  • Compression Method: REAP (Router-weighted Expert Activation Pruning)
  • Compression Ratio: 25% expert pruning
  • Type: Sparse Mixture-of-Experts (SMoE) Causal Language Model
  • Number of Parameters: 172B total, 10B activated per token
  • Number of Layers: 62
  • Number of Attention Heads: 48
  • Number of Experts: 192 (uniformly pruned from 256)
  • Number of Activated Experts: 8 per token
  • Context Length: 196,608 tokens
  • License: Modified MIT

📊 Evaluations

TBD


🚀 Deployment

You can deploy the model directly using the latest vLLM (that supports MiniMax-M2.5), no source modifications or custom patches required.

vllm serve cerebras/MiniMax-M2.5-REAP-162B-A10B \
    --tensor-parallel-size 8 \
    --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2_append_think \
    --trust-remote-code \
    --enable_expert_parallel \
    --enable-auto-tool-choice

If you encounter insufficient memory when running this model, you might need to set a lower value for --max-num-seqs flag (e.g. set to 64). For more information, refer to the official vLLM deployment guide.

🧩 Model Creation

This checkpoint was created by applying the REAP (Router-weighted Expert Activation Pruning) method uniformly across all Mixture-of-Experts (MoE) blocks of MiniMax-M2.5, with a 25% pruning rate.

How REAP Works

REAP selects experts to prune based on a novel saliency criterion that considers both:

  • Router gate values: How frequently and strongly the router activates each expert
  • Expert activation norms: The magnitude of each expert's output contributions

This dual consideration ensures that experts contributing minimally to the layer's output are pruned, while preserving those that play critical roles in the model's computations.

Key Advantages

  • One-Shot Compression: No fine-tuning required after pruning - the model is immediately ready for deployment
  • Preserved Router Control: Unlike expert merging methods, REAP maintains the router's independent, input-dependent control over remaining experts, avoiding "functional subspace collapse"
  • Generative Task Superiority: REAP significantly outperforms expert merging approaches on generative benchmarks (code generation, creative writing, mathematical reasoning) while maintaining competitive performance on discriminative tasks

📚 For more details, refer to the following resources:


⚖️ License

This model is derived from MiniMaxAI/MiniMax-M2.5 and distributed under the modified MIT license.


🧾 Citation

If you use this checkpoint, please cite the REAP paper:

@article{lasby-reap,
  title={REAP the Experts: Why Pruning Prevails for One-Shot MoE compression},
  author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
  journal={arXiv preprint arXiv:2510.13999},
  year={2025}
}

Author: cerebras

Likes: 5

Downloads: 0

Tags: transformers, safetensors, minimax_m2, text-generation, minimax, MOE, pruning, compression, conversational, custom_code, en, arxiv:2510.13999, base_model:MiniMaxAI/MiniMax-M2.5, base_model:quantized:MiniMaxAI/MiniMax-M2.5, license:other, endpoints_compatible, fp8, region:us

nvidia/NVIDIA-Nemotron-Parse-v1.2


license: other

Model Overview

Description:

NVIDIA Nemotron Parse v1.2 is designed to understand document semantics and extract text and tables elements with spatial grounding. Given an image, NVIDIA Nemotron Parse v1.2 produces structured annotations, including formatted text, bounding-boxes and the corresponding semantic classes, ordered according to the document's reading flow. It overcomes the shortcomings of traditional OCR technologies that struggle with complex document layouts with structural variability, and helps transform unstructured documents into actionable and machine-usable representations. This has several downstream benefits such as increasing the availability of training-data for Large Language Models (LLMs), improving the accuracy of extractor, curator, retriever and AI agentic applications, and enhancing document understanding pipelines.<br>

This model is ready for commercial use. <br>

Quick Start

Install dependencies in your environment

You can use a public image nvcr.io/nvidia/pytorch:25.03-py3 with the following library versions installed on top:

pip install accelerate==1.12.0
pip install albumentations==2.0.8
pip install transformers==4.51.3
pip install timm==1.0.22

Usage example

import torch
from PIL import Image, ImageDraw
from transformers import AutoModel, AutoProcessor, AutoTokenizer, AutoConfig, AutoImageProcessor, GenerationConfig
from postprocessing import extract_classes_bboxes, transform_bbox_to_original, postprocess_text

# Load model and processor
model_path = "nvidia/NVIDIA-Nemotron-Parse-v1.2"  # Or use a local path
device = "cuda:0"

model = AutoModel.from_pretrained(
    model_path,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16
).to(device).eval()
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

# Load image
image = Image.open("path/to/your/image.jpg")
task_prompt = "</s><s><predict_bbox><predict_classes><output_markdown><predict_no_text_in_pic>"
# task_prompt = "</s><s><predict_bbox><predict_classes><output_markdown><predict_text_in_pic>"

# Process image
inputs = processor(images=[image], text=task_prompt, return_tensors="pt", add_special_tokens=False).to(device)

generation_config = GenerationConfig.from_pretrained(model_path, trust_remote_code=True)
# Generate text
outputs = model.generate(**inputs,  generation_config=generation_config)

# Decode the generated text
generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0]

Postprocessing

from PIL import Image, ImageDraw
from postprocessing import extract_classes_bboxes, transform_bbox_to_original, postprocess_text

classes, bboxes, texts = extract_classes_bboxes(generated_text)
bboxes = [transform_bbox_to_original(bbox, image.width, image.height) for bbox in bboxes]

# Specify output formats for postprocessing
table_format = 'latex' # latex | HTML | markdown
text_format = 'markdown' # markdown | plain
blank_text_in_figures = False # remove text inside 'Picture' class
texts = [postprocess_text(text, cls = cls, table_format=table_format, text_format=text_format, blank_text_in_figures=blank_text_in_figures) for text, cls in zip(texts, classes)]

for cl, bb, txt in zip(classes, bboxes, texts):
    print(cl, ': ', txt)

draw = ImageDraw.Draw(image)
for bbox in bboxes:
  draw.rectangle((bbox[0], bbox[1], bbox[2], bbox[3]), outline="red")

Inference with VLLM

Nemotron-Parse-v1.2 is available in vllm main and can be found in vllm/vllm-openai:v0.14.1 docker image.

Note: when running on A100/A10 we recommend running vllm serve with --attention-backend=TRITON_ATTN

You will need to install the following dependencies on top, and then follow the VLLM Inference example below:

pip install albumentations timm open_clip_torch

VLLM Inference example

Option 1: end-to-end python inference

from vllm import LLM, SamplingParams
from PIL import Image

def main():
    sampling_params = SamplingParams(
        temperature=0,
        top_k=1,
        repetition_penalty=1.1,
        max_tokens=9000,
        skip_special_tokens=False,
    )
    
    llm = LLM(
        model="nvidia/NVIDIA-Nemotron-Parse-v1.2",
        max_num_seqs=64,
        limit_mm_per_prompt={"image": 1},
        dtype="bfloat16",
        trust_remote_code=True,
    )
    
    image = Image.open("<YOUR-IMAGE-PATH>")
    
    prompts = [
        {  # Implicit prompt
            "prompt": "</s><s><predict_bbox><predict_classes><output_markdown><predict_no_text_in_pic>",
            "multi_modal_data": {
                "image": image
            },
        },
        {  # Explicit encoder/decoder prompt
            "encoder_prompt": {
                "prompt": "",
                "multi_modal_data": {
                    "image": image
                },
            },
            "decoder_prompt": "</s><s><predict_bbox><predict_classes><output_markdown><predict_no_text_in_pic>",
        },
    ]
    
    outputs = llm.generate(prompts, sampling_params)
    
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Decoder prompt: {prompt!r}, Generated text: {generated_text!r}")

if __name__ == "__main__":
    main()

Option 2: vllm serve

Alternatively, you can start a vllm server as:

vllm serve nvidia/NVIDIA-Nemotron-Parse-v1.2 \
    --dtype bfloat16 \
    --max-num-seqs 8 \
    --limit-mm-per-prompt '{"image": 1}' \
    --trust-remote-code \
    --port 8000 \
    --chat-template chat_template.jinja

with chat_template.jinja provided in this repository. Then, you can run inference as:

import base64
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
)

# Read and base64-encode the image
with open(<your-image-path>, "rb") as f:
    img_b64 = base64.b64encode(f.read()).decode("utf-8")
prompt_text = "</s><s><predict_bbox><predict_classes><output_markdown><predict_no_text_in_pic>"

resp = client.chat.completions.create(
    model="nvidia/NVIDIA-Nemotron-Parse-v1.2",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": prompt_text,
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{img_b64}",
                    },
                },
            ],
        }
    ],
    max_tokens=9000,
    temperature=0.0,
    extra_body={
        "repetition_penalty": 1.1,
        "top_k": 1,
        "skip_special_tokens": False,
    },
)
print(resp.choices[0].message.content)

Note: we recommend using the default prompt that extracts bounding boxes, classes, and text in markdown formatting for all use cases (</s><s><predict_bbox><predict_classes><output_markdown><predict_no_text_in_pic> or </s><s><predict_bbox><predict_classes><output_markdown><predict_text_in_pic>). If necessary, optionally the prompt that omits text extraction and only outputs bounding boxes and classes could be used: </s><s><predict_bbox><predict_classes><output_no_text><predict_no_text_in_pic>.

Logits processors

With Nemotron-Parse-v1.2 we share 2 logits processors available in logitsprocessors/ dir for vllm and in hf_logits_processor.py for the python model. NemotronParseRepetitionStopProcessor - when used during generation, detects repeating n-grams and forces the model to close the <x_><y_> block when detecting potential hallucination. NemotronParseTableInsertionLogitsProcessor - forces every block to follow a table structure (useful if, e.g., you are running the model on table image crops)

Please refer to the example_with_processor.py for example usage with python model. With vllm, you can provide these as arguments to vllm serve, after exporting logitsprocs/ to PYTHONPATH, e.g.:

vllm serve nvidia/NVIDIA-Nemotron-Parse-v1.2 \
  --dtype bfloat16 \
  --max-num-seqs 4 \
  --limit-mm-per-prompt '{"image": 1}' \
  --attention-backend=TRITON_ATTN \
  --trust-remote-code \
  --logits-processors nemotron_parse_vllm_logitprocs:NemotronParseTableInsertionLogitsProcessor \
  --port 8000

An example of inference with vllm openai server is available in vllm_example.py

License/Terms of Use

Governing Terms: Your use of this model is governed by the: NVIDIA Nemotron Open Model License. Use of the tokenizer included in this model is governed by the CC-BY-4.0 license.<br>

Deployment Geography:

Global<br>

Use Case: <br>

NVIDIA Nemotron Parse v1.2 will be capable of comprehensive text understanding and document structure understanding. It will be used in retriever and curator solutions. Its text extraction datasets and capabilities will help with LLM and VLM training, as well as improve run-time inference accuracy of VLMs. The NVIDIA Nemotron Parse v1.2 model will perform text extraction from PDF and PPT documents. The NVIDIA Nemotron Parse v1.2 can classify the objects (title, section, caption, index, footnote, lists, tables, bibliography, image) in a given document, and provide bounding boxes with coordinates.<br>

Release Date: <br>

Hugging Face [02/17/2026] via [URL] <br>

References(s):

  • https://huggingface.co/docs/transformers/en/model_doc/mbart <br>

Model Architecture:

Architecture Type: Transformer-based vision-encoder-decoder model<br>

Network Architecture: <br>

  • Vision Encoder: ViT-H model (https://huggingface.co/nvidia/C-RADIO)<br>
  • Adapter Layer: 1D convolutions & norms to compress dimensionality and sequence length of the latent space (1280 tokens to 320 tokens)<br>
  • Decoder: mBart [1] 10 blocks<br>
  • Tokenizer: Use of the tokenizer included in this model is governed by the CC-BY-4.0 license<br>
  • Number of Parameters: < 1B<br>

Input: <br>

  • Input Type: Image, Text<br>
  • Input Type(s): Red, Green, Blue (RGB) + Prompt (String)
  • Input Parameters: Two-Dimensional (2D), One-Dimensional (1D)
  • Other Properties Related to Input:
    • Max Input Resolution (Width, Height): 1664, 2048
    • Min Input Resolution (Width, Height): 1024, 1280
  • Channel Count: 3

Output: <br>

  • Output Type: Text<br>
  • Output Format: String
  • Output Parameters: One-Dimensional (1D)
  • Other Properties Related to Output: Nemotron-parse output format is a string which encodes text content (formatted or not) as well as bounding boxes and class attributes.<br>

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br>

Software Integration:

Runtime Engine(s):

  • TensorRT-LLM <br>
  • vLLM <br>

Supported Hardware Microarchitecture Compatibility: <br>

  • NVIDIA Ampere <br>
  • NVIDIA Blackwell <br>
  • NVIDIA Hopper <br>
  • NVIDIA Turing <br>

Supported Operating System(s):

  • [Linux] <br>

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment. <br>

Model Version(s):

Nemotron Parse 1.2 <br>

Training, Testing, and Evaluation Datasets:

Training Dataset

** Image Training Data Size <br>

  • [1 Million to 1 Billion Images] <br>

** Text Training Data Size <br>

  • [1 Billion to 10 Trillion Tokens] <br>

** Data Collection Method by dataset <br>

  • Hybrid: Automated, Human, Synthetic <br>

** Labeling Method by dataset <br>

  • Hybrid: Automated, Human, Synthetic <br>

Properties (Quantity, Dataset Descriptions, Sensor(s)): The training set contains millions of image–text items, aggregated across many large document and table datasets totaling several terabytes of data. The data consists of document-page and table images paired with OCR text, bounding boxes, and layout labels, drawn from real-world sources (scientific papers, PDFs, Wikipedia pages) as well as fully synthetic tables and word/character renderings. Modalities are primarily images plus associated text and structural annotations; content spans public-domain resources, and synthetic data. Images are obtained by rendering digital documents or generating synthetic layouts, and annotations come from OCR/layout models, third-party OCR services, and human labeling. <br>

Inference:

Acceleration Engine: Tensor(RT)-LLM, vLLM <br> Test Hardware: <br>

  • H100 <br>
  • A100 <br>

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. <br>

Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included. <br>

For more detailed information on ethical considerations for this model, please see the Model Card++ Bias, Explainability, Safety & Security, and Privacy Subcards Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here. <br>

Author: nvidia

Likes: 4

Downloads: 0

Tags: safetensors, nemotron_parse, custom_code, license:other, region:us

cerebras/MiniMax-M2.5-REAP-139B-A10B


language:

  • en library_name: transformers tags:
  • minimax
  • MOE
  • pruning
  • compression license: other name: cerebras/MiniMax-M2.5-REAP-139B-A10B description: > This model was obtained by uniformly pruning 40% of experts in MiniMax-M2.5 using the REAP method. readme: > https://huggingface.co/cerebras/MiniMax-M2.5-REAP-139B-A10B/main/README.md pipeline_tag: text-generation base_model:
  • MiniMaxAI/MiniMax-M2.5

<p align="center"> <em>𓌳 <strong>REAP</strong>𓌳 the Experts: Why Pruning Prevails for One-Shot MoE Compression</em><br> <img src="https://i.imgur.com/rmzG3gg.png" alt="REAP" width="75%"> </p>

MiniMax-M2.5-REAP-162B-A10B

✨ Highlights

Introducing MiniMax-M2.5-REAP-139B-A10B, a memory-efficient compressed variant of MiniMax-M2.5 that maintains near-identical performance while being 40% lighter.

This model was created using REAP (Router-weighted Expert Activation Pruning), a novel expert pruning method that selectively removes redundant experts while preserving the router's independent control over remaining experts. Key features include:

  • Near-Lossless Performance: Maintains almost identical accuracy on code generation, agentic coding, and function calling tasks compared to the full 230B model
  • 40% Memory Reduction: Compressed from 230B to 139B parameters, significantly lowering deployment costs and memory requirements
  • Preserved Capabilities: Retains all core functionalities including code generation, math & reasoning and tool calling.
  • Drop-in Compatibility: Works with vanilla vLLM - no source modifications or custom patches required
  • Optimized for Real-World Use: Particularly effective for resource-constrained environments, local deployments, and academic research

📋 Model Overview

MiniMax-M2.5-REAP-162B-A10B has the following specifications:

  • Base Model: MiniMax-M2.5
  • Compression Method: REAP (Router-weighted Expert Activation Pruning)
  • Compression Ratio: 40% expert pruning
  • Type: Sparse Mixture-of-Experts (SMoE) Causal Language Model
  • Number of Parameters: 139B total, 10B activated per token
  • Number of Layers: 62
  • Number of Attention Heads: 48
  • Number of Experts: 154 (uniformly pruned from 256)
  • Number of Activated Experts: 8 per token
  • Context Length: 196,608 tokens
  • License: Modified MIT

📊 Evaluations

TBD


🚀 Deployment

You can deploy the model directly using the latest vLLM (that supports MiniMax-M2.5), no source modifications or custom patches required.

vllm serve cerebras/MiniMax-M2.5-REAP-139B-A10B \
    --tensor-parallel-size 8 \
    --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2_append_think \
    --trust-remote-code \
    --enable_expert_parallel \
    --enable-auto-tool-choice

If you encounter insufficient memory when running this model, you might need to set a lower value for --max-num-seqs flag (e.g. set to 64). For more information, refer to the official vLLM deployment guide.

🧩 Model Creation

This checkpoint was created by applying the REAP (Router-weighted Expert Activation Pruning) method uniformly across all Mixture-of-Experts (MoE) blocks of MiniMax-M2.5, with a 40% pruning rate.

How REAP Works

REAP selects experts to prune based on a novel saliency criterion that considers both:

  • Router gate values: How frequently and strongly the router activates each expert
  • Expert activation norms: The magnitude of each expert's output contributions

This dual consideration ensures that experts contributing minimally to the layer's output are pruned, while preserving those that play critical roles in the model's computations.

Key Advantages

  • One-Shot Compression: No fine-tuning required after pruning - the model is immediately ready for deployment
  • Preserved Router Control: Unlike expert merging methods, REAP maintains the router's independent, input-dependent control over remaining experts, avoiding "functional subspace collapse"
  • Generative Task Superiority: REAP significantly outperforms expert merging approaches on generative benchmarks (code generation, creative writing, mathematical reasoning) while maintaining competitive performance on discriminative tasks

📚 For more details, refer to the following resources:


⚖️ License

This model is derived from MiniMaxAI/MiniMax-M2.5 and distributed under the modified MIT license.


🧾 Citation

If you use this checkpoint, please cite the REAP paper:

@article{lasby-reap,
  title={REAP the Experts: Why Pruning Prevails for One-Shot MoE compression},
  author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
  journal={arXiv preprint arXiv:2510.13999},
  year={2025}
}

Author: cerebras

Likes: 3

Downloads: 0

Tags: transformers, safetensors, minimax_m2, text-generation, minimax, MOE, pruning, compression, conversational, custom_code, en, arxiv:2510.13999, base_model:MiniMaxAI/MiniMax-M2.5, base_model:quantized:MiniMaxAI/MiniMax-M2.5, license:other, endpoints_compatible, fp8, region:us

BiliSakura/BitDance-14B-64x-diffusers


license: apache-2.0 library_name: diffusers pipeline_tag: text-to-image base_model: shallowdream204/BitDance-14B-64x language:

  • en tags:
  • bitdance
  • text-to-image
  • custom-pipeline
  • diffusers
  • qwen

BitDance-14B-64x (Diffusers)

Diffusers-converted checkpoint for BitDance-14B-64x with bundled custom pipeline code (bitdance_diffusers) for direct loading with DiffusionPipeline.

Quickstart (native diffusers)

import torch
from diffusers import DiffusionPipeline

# Local path (recommended - no trust_remote_code needed)
model_path = "BiliSakura/BitDance-14B-64x-diffusers"
pipe = DiffusionPipeline.from_pretrained(
    model_path,
    custom_pipeline=model_path,
    torch_dtype=torch.bfloat16,
).to("cuda")

result = pipe(
    prompt = "A close-up portrait in a cinematic photography style, capturing a girl-next-door look on a sunny daytime urban street. She wears a khaki sweater, with long, flowing hair gently draped over her shoulders. Her head is turned slightly, revealing soft facial features illuminated by realistic, delicate sunlight coming from the left. The sunlight subtly highlights individual strands of her hair. The image has a Canon film-like color tone, evoking a warm nostalgic atmosphere.",
    height=1024,
    width=1024,
    num_inference_steps=50,
    guidance_scale=7.5,
)
result.images[0].save("bitdance_14b_64x.png")

Test Running

Run tests from the model directory in your active Python environment:

python test_bitdance.py

VRAM Usage by Resolution

Measured on NVIDIA A100-SXM4-80GB using:

  • dtype=torch.bfloat16
  • num_inference_steps=30
  • guidance_scale=7.5
  • prompt: A close-up portrait in a cinematic photography style, capturing a girl-next-door look on a sunny daytime urban street. She wears a khaki sweater, with long, flowing hair gently draped over her shoulders. Her head is turned slightly, revealing soft facial features illuminated by realistic, delicate sunlight coming from the left. The sunlight subtly highlights individual strands of her hair. The image has a Canon film-like color tone, evoking a warm nostalgic atmosphere.

| Resolution | Peak Allocated VRAM (GiB) | Peak Reserved VRAM (GiB) | Time (s) | Status | | --- | ---: | ---: | ---: | --- | | 512x512 | 39.60 | 40.62 | 4.08 | ok | | 1024x1024 | 41.21 | 50.15 | 15.79 | ok | | 1280x768 | 40.88 | 49.52 | 14.78 | ok | | 768x1280 | 40.88 | 49.52 | 14.75 | ok | | 1536x640 | 40.88 | 49.52 | 14.76 | ok | | 2048x512 | 41.21 | 50.15 | 15.85 | ok |

Model Metadata

  • Pipeline class: BitDanceDiffusionPipeline
  • Diffusers version in config: 0.36.0
  • Parallel prediction factor: 64
  • Text stack: Qwen3ForCausalLM + Qwen2TokenizerFast
  • Supported resolutions include 1024x1024, 1280x768, 768x1280, 2048x512, and more (see model_index.json)

Citation

If you use this model, please cite BitDance and Diffusers:

@article{ai2026bitdance,
  title   = {BitDance: Scaling Autoregressive Generative Models with Binary Tokens},
  author  = {Ai, Yuang and Han, Jiaming and Zhuang, Shaobin and Hu, Xuefeng and Yang, Ziyan and Yang, Zhenheng and Huang, Huaibo and Yue, Xiangyu and Chen, Hao},
  journal = {arXiv preprint arXiv:2602.14041},
  year    = {2026}
}

@inproceedings{von-platen-etal-2022-diffusers,
  title     = {Diffusers: State-of-the-art diffusion models},
  author    = {Patrick von Platen and Suraj Patil and Anton Lozhkov and Damar Jablonski and Hernan Bischof and Thomas Wolf},
  booktitle = {GitHub repository},
  year      = {2022},
  url       = {https://github.com/huggingface/diffusers}
}

License

This repository is distributed under the Apache-2.0 license, consistent with the upstream BitDance release.

Author: BiliSakura

Likes: 3

Downloads: 10

Tags: diffusers, safetensors, bitdance, text-to-image, custom-pipeline, qwen, en, arxiv:2602.14041, base_model:shallowdream204/BitDance-14B-64x, base_model:finetune:shallowdream204/BitDance-14B-64x, license:apache-2.0, diffusers:BitDanceDiffusionPipeline, region:us

BiliSakura/BitDance-14B-16x-diffusers


license: apache-2.0 library_name: diffusers pipeline_tag: text-to-image base_model: shallowdream204/BitDance-14B-16x language:

  • en tags:
  • bitdance
  • text-to-image
  • custom-pipeline
  • diffusers
  • qwen

BitDance-14B-16x (Diffusers)

Diffusers-converted checkpoint for BitDance-14B-16x with bundled custom pipeline code (bitdance_diffusers) so it can be loaded directly with DiffusionPipeline.

Quickstart (native diffusers)

import torch
from diffusers import DiffusionPipeline

# Local path (recommended - no trust_remote_code needed)
model_path = "BiliSakura/BitDance-14B-16x-diffusers"
pipe = DiffusionPipeline.from_pretrained(
    model_path,
    custom_pipeline=model_path,
    torch_dtype=torch.bfloat16,
).to("cuda")

result = pipe(
    prompt="A close-up portrait in a cinematic photography style, capturing a girl-next-door look on a sunny daytime urban street. She wears a khaki sweater, with long, flowing hair gently draped over her shoulders. Her head is turned slightly, revealing soft facial features illuminated by realistic, delicate sunlight coming from the left. The sunlight subtly highlights individual strands of her hair. The image has a Canon film-like color tone, evoking a warm nostalgic atmosphere.",
    height=1024,
    width=1024,
    num_inference_steps=50,
    guidance_scale=7.5,
    show_progress_bar=True,
)
result.images[0].save("bitdance_14b_16x.png")

Test Running

Run tests from the model directory in your active Python environment:

python test_bitdance.py

VRAM Usage by Resolution

Measured on NVIDIA A100-SXM4-80GB using:

  • dtype=torch.bfloat16
  • num_inference_steps=30
  • guidance_scale=7.5
  • prompt: A cinematic landscape photo of snowy mountains at sunrise.

| Resolution | Peak Allocated VRAM (GiB) | Peak Reserved VRAM (GiB) | Time (s) | Status | | --- | ---: | ---: | ---: | --- | | 512x512 | 32.67 | 33.47 | 13.71 | ok | | 1024x1024 | 35.51 | 38.76 | 54.47 | ok | | 1280x768 | 35.28 | 38.34 | 50.97 | ok | | 768x1280 | 35.28 | 38.34 | 51.22 | ok | | 1536x640 | 35.28 | 38.34 | 51.29 | ok | | 2048x512 | 35.51 | 38.76 | 54.61 | ok |

Model Metadata

  • Pipeline class: BitDanceDiffusionPipeline
  • Diffusers version in config: 0.36.0
  • Parallel prediction factor: 16
  • Text stack: Qwen3ForCausalLM + Qwen2TokenizerFast
  • Supported resolutions include 1024x1024, 1280x768, 768x1280, 2048x512, and more (see model_index.json)

Citation

If you use this model, please cite BitDance and Diffusers:

@article{ai2026bitdance,
  title   = {BitDance: Scaling Autoregressive Generative Models with Binary Tokens},
  author  = {Ai, Yuang and Han, Jiaming and Zhuang, Shaobin and Hu, Xuefeng and Yang, Ziyan and Yang, Zhenheng and Huang, Huaibo and Yue, Xiangyu and Chen, Hao},
  journal = {arXiv preprint arXiv:2602.14041},
  year    = {2026}
}

@inproceedings{von-platen-etal-2022-diffusers,
  title     = {Diffusers: State-of-the-art diffusion models},
  author    = {Patrick von Platen and Suraj Patil and Anton Lozhkov and Damar Jablonski and Hernan Bischof and Thomas Wolf},
  booktitle = {GitHub repository},
  year      = {2022},
  url       = {https://github.com/huggingface/diffusers}
}

License

This repository is distributed under the Apache-2.0 license, consistent with the upstream BitDance release.

Author: BiliSakura

Likes: 3

Downloads: 100

Tags: diffusers, safetensors, bitdance, text-to-image, custom-pipeline, qwen, en, arxiv:2602.14041, base_model:shallowdream204/BitDance-14B-16x, base_model:finetune:shallowdream204/BitDance-14B-16x, license:apache-2.0, diffusers:BitDanceDiffusionPipeline, region:us

ShaileshH/smol-workertech


library_name: transformers tags:

  • smollm2
  • automotive
  • question-answering
  • instruction-tuning
  • domain-adaptation
  • workshop-assistant

Model Card for SmolLM2-135M-Technician-QA

This model is a domain-adapted version of HuggingFaceTB/SmolLM2-135M, fine-tuned to answer questions related to automotive service, technician workflows, diagnostics, and spare part replacement scenarios.

It is optimized for lightweight deployment in workshop assistants, service center copilots, and edge devices.


Model Details

Model Description

SmolLM2-135M-Technician-QA is a compact instruction-following language model fine-tuned on a curated dataset of technician question-answer pairs covering:

  • Customer vehicle issues
  • Technical diagnostics
  • Work order lifecycle
  • Periodic service procedures
  • Spare part replacement decisions
  • On-site breakdown support

The model is designed for real-world automotive service environments where fast and efficient inference is required.

  • Developed by: Shailesh H
  • Funded by: Self / Research & Development
  • Shared by: Shailesh H
  • Model type: Causal Language Model (Instruction-tuned)
  • Language(s) (NLP): English
  • License: Apache-2.0
  • Finetuned from model: HuggingFaceTB/SmolLM2-135M

Model Sources

  • Repository: https://huggingface.co/<your-username>/SmolLM2-135M-Technician-QA
  • Base Model: https://huggingface.co/HuggingFaceTB/SmolLM2-135M

Uses

Direct Use

This model can be used for:

  • Automotive technician assistants
  • Workshop chatbot systems
  • Service advisor support
  • Troubleshooting guidance
  • Training simulators for technicians

Downstream Use

The model can be integrated into:

  • RAG systems with service manuals
  • Mobile workshop applications
  • Edge diagnostic tools
  • Voice-based service assistants

Out-of-Scope Use

This model should NOT be used for:

  • Safety-critical vehicle control
  • Legal or compliance decisions
  • Autonomous driving systems
  • Financial or medical advice

Bias, Risks, and Limitations

  • Trained on synthetic domain data → may not cover all vehicle models
  • Limited general world knowledge due to small model size
  • May generate plausible but incorrect repair steps
  • English-only responses

Recommendations

  • Always verify outputs with OEM service manuals
  • Use as an assistive tool, not a final authority
  • Combine with RAG for production deployment

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "<your-username>/SmolLM2-135M-Technician-QA"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

prompt = "Customer says the car battery drains overnight. What should you check?"

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=120)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Evaluation

Testing Data, Factors & Metrics

Testing Data

Held-out automotive technician QA samples from the same domain.

Factors

Customer complaint handling

Diagnostic reasoning

Spare part replacement logic

Service workflow understanding

Metrics

Perplexity

Instruction-following accuracy

Manual domain evaluation

Results

Strong performance on workshop troubleshooting queries

Accurate step-by-step diagnostic suggestions

Fast inference on CPU

Summary

The fine-tuned model shows clear domain adaptation compared to the base SmolLM2 model, especially for automotive service workflows.

Author: ShaileshH

Likes: 2

Downloads: 0

Tags: transformers, safetensors, llama, text-generation, smollm2, automotive, question-answering, instruction-tuning, domain-adaptation, workshop-assistant, text-generation-inference, endpoints_compatible, region:us

lixiaoxi45/OmniAtlas-Qwen2.5-7B


license: apache-2.0

Author: lixiaoxi45

Likes: 2

Downloads: 0

Tags: safetensors, qwen2_5_omni, license:apache-2.0, region:us

mradermacher/Nanbeige4.1-3B-heretic-i1-GGUF


base_model: heretic-org/Nanbeige4.1-3B-heretic language:

  • en
  • zh library_name: transformers license: apache-2.0 mradermacher: readme_rev: 1 quantized_by: mradermacher tags:
  • llm
  • nanbeige
  • heretic
  • uncensored
  • decensored
  • abliterated

About

<!-- ### quantize_version: 2 --> <!-- ### output_tensor_quantised: 1 --> <!-- ### convert_type: hf --> <!-- ### vocab_type: --> <!-- ### tags: nicoboss --> <!-- ### quants: Q2_K IQ3_M Q4_K_S IQ3_XXS Q3_K_M small-IQ4_NL Q4_K_M IQ2_M Q6_K IQ4_XS Q2_K_S IQ1_M Q3_K_S IQ2_XXS Q3_K_L IQ2_XS Q5_K_S IQ2_S IQ1_S Q5_K_M Q4_0 IQ3_XS Q4_1 IQ3_S --> <!-- ### quants_skip: --> <!-- ### skip_mmproj: -->

weighted/imatrix quants of https://huggingface.co/heretic-org/Nanbeige4.1-3B-heretic

<!-- provided-files -->

For a convenient overview and download list, visit our model page for this model.

static quants are available at https://huggingface.co/mradermacher/Nanbeige4.1-3B-heretic-GGUF

Usage

If you are unsure how to use GGUF files, refer to one of TheBloke's READMEs for more details, including on how to concatenate multi-part files.

Provided Quants

(sorted by size, not necessarily quality. IQ-quants are often preferable over similar sized non-IQ quants)

| Link | Type | Size/GB | Notes | |:-----|:-----|--------:|:------| | GGUF | imatrix | 0.1 | imatrix file (for creating your own quants) | | GGUF | i1-IQ1_S | 1.2 | for the desperate | | GGUF | i1-IQ1_M | 1.2 | mostly desperate | | GGUF | i1-IQ2_XXS | 1.4 | | | GGUF | i1-IQ2_XS | 1.4 | | | GGUF | i1-IQ2_S | 1.5 | | | GGUF | i1-IQ2_M | 1.6 | | | GGUF | i1-Q2_K_S | 1.6 | very low quality | | GGUF | i1-Q2_K | 1.7 | IQ3_XXS probably better | | GGUF | i1-IQ3_XXS | 1.8 | lower quality | | GGUF | i1-IQ3_XS | 1.9 | | | GGUF | i1-Q3_K_S | 2.0 | IQ3_XS probably better | | GGUF | i1-IQ3_S | 2.0 | beats Q3_K* | | GGUF | i1-IQ3_M | 2.0 | | | GGUF | i1-Q3_K_M | 2.1 | IQ3_S probably better | | GGUF | i1-Q3_K_L | 2.2 | IQ3_M probably better | | GGUF | i1-IQ4_XS | 2.3 | | | GGUF | i1-IQ4_NL | 2.4 | prefer IQ4_XS | | GGUF | i1-Q4_0 | 2.4 | fast, low quality | | GGUF | i1-Q4_K_S | 2.4 | optimal size/speed/quality | | GGUF | i1-Q4_K_M | 2.5 | fast, recommended | | GGUF | i1-Q4_1 | 2.6 | | | GGUF | i1-Q5_K_S | 2.9 | | | GGUF | i1-Q5_K_M | 2.9 | | | GGUF | i1-Q6_K | 3.3 | practically like static Q6_K |

Here is a handy graph by ikawrakow comparing some lower-quality quant types (lower is better):

image.png

And here are Artefact2's thoughts on the matter: https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9

FAQ / Model Request

See https://huggingface.co/mradermacher/model_requests for some answers to questions you might have and/or if you want some other model quantized.

Thanks

I thank my company, nethype GmbH, for letting me use its servers and providing upgrades to my workstation to enable this work in my free time. Additional thanks to @nicoboss for giving me access to his private supercomputer, enabling me to provide many more imatrix quants, at much higher quality, than I would otherwise be able to.

<!-- end -->

Author: mradermacher

Likes: 2

Downloads: 0

Tags: transformers, gguf, llm, nanbeige, heretic, uncensored, decensored, abliterated, en, zh, base_model:heretic-org/Nanbeige4.1-3B-heretic, base_model:quantized:heretic-org/Nanbeige4.1-3B-heretic, license:apache-2.0, endpoints_compatible, region:us, imatrix, conversational