Todays AI Summary

Qwen3-Next Models Push Boundaries in Long-Context and Reasoning, ToolOrchestra Enhances Intelligence

Today's AI landscape sees significant advancements in both model architecture and multi-agent systems. The Qwen3-Next models are making waves with their innovative approach to long-context handling and reasoning, while ToolOrchestra introduces a novel method for efficient model and tool orchestration.

Noteworthy Research Papers

  • Revisiting Generalization Across Difficulty Levels: It's Not So Easy: This paper challenges the assumption that training on either easy or hard data consistently improves LLM performance across all difficulty levels. The study emphasizes the importance of diverse difficulty levels in both training and evaluation datasets.
  • ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration: This research introduces ToolOrchestra, a method for training small orchestrator models that coordinate intelligent tools. An 8B model, Orchestrator, achieves higher accuracy and efficiency on complex agentic tasks compared to larger models like GPT-5. The model uses reinforcement learning with outcome-, efficiency-, and user-preference-aware rewards.
  • G²VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning: This paper presents G²VLM, a vision-language model that integrates 3D reconstruction and spatial understanding. By leveraging learned 3D visual geometry features, G²VLM enhances spatial reasoning tasks and achieves competitive results in both 3D reconstruction and spatial understanding.
  • Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework: Matrix is a decentralized framework for multi-agent synthetic data generation. Its peer-to-peer design eliminates the need for a central orchestrator, enabling scalable and adaptable data generation workflows with improved throughput.
  • Agentic Learner with Grow-and-Refine Multimodal Semantic Memory: This research introduces ViLoMem, a dual-stream memory framework that constructs compact, schema-based memory for multimodal learning. By separately encoding visual distraction patterns and logical reasoning errors, ViLoMem improves accuracy and reduces repeated errors in multimodal benchmarks.
  • Through the telecom lens: Are all training samples important?: This paper questions the assumption that all training samples contribute equally by focusing on applying and analyzing the roles of individual samples in telecom training and assessing whether the proposed model optimizes computation and energy use.
  • Escaping the Verifier: Learning to Reason via Demonstrations: This paper introduces RARO (Relativistic Adversarial Reasoning Optimization) that learns strong reasoning capabilities from only expert demonstrations via Inverse Reinforcement Learning.
  • Attention-Guided Patch-Wise Sparse Adversarial Attacks on Vision-Language-Action Models: This paper proposes ADVLA, a framework that directly applies adversarial perturbations on features projected from the visual encoder into the textual feature space.
  • Continual Error Correction on Low-Resource Devices: This paper presents a novel system enabling users to correct AI misclassifications through few-shot learning, requiring minimal computational resources and storage.
  • Bridging the Unavoidable A Priori: A Framework for Comparative Causal Modeling: This paper brings system dynamics and structural equation modeling together into a common mathematical framework that can be used to generate systems from distributions, develop methods, and compare results to inform the underlying epistemology of system dynamics for data science and AI/ML applications.

New Models on the Scene

  • unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF (29 likes): This model, part of the Qwen3-Next series, features hybrid attention, high-sparsity Mixture-of-Experts (MoE), stability optimizations, and Multi-Token Prediction (MTP). It supports ultra-long context lengths up to 256K tokens and excels in various benchmarks, demonstrating strong performance in knowledge, reasoning, coding, alignment, and agentic tasks.
  • unsloth/Qwen3-Next-80B-A3B-Thinking-GGUF (14 likes): Another installment in the Qwen3-Next series, this model focuses on "thinking" mode, leveraging GSPO to enhance stability and efficiency. It outperforms previous Qwen3 models and even proprietary models like Gemini-2.5-Flash-Thinking on complex reasoning tasks.
  • bdsqlsz/qinglong_DetailedEyes_Z-Image (10 likes): A LoRA model for text-to-image generation, based on Tongyi-MAI/Z-Image-Turbo.
  • YanLabs/gemma3-27b-it-abliterated-normpreserve (9 likes): An abliterated version of google/gemma-3-27b-

AI Papers for 2026-02-16

Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment

The long-standing vision of general-purpose robots hinges on their ability to understand and act upon natural language instructions. Vision-Language-Action (VLA) models have made remarkable progress toward this goal, yet their generated actions can still misalign with the given instructions. In this paper, we investigate test-time verification as a means to shrink the "intention-action gap.'' We first characterize the test-time scaling law for embodied instruction following and demonstrate that jointly scaling the number of rephrased instructions and generated actions greatly increases test-time sample diversity, often recovering correct actions more efficiently than scaling each dimension independently. To capitalize on these scaling laws, we present CoVer, a contrastive verifier for vision-language-action alignment, and show that our architecture scales gracefully with additional computational resources and data. We then introduce "boot-time compute" and a hierarchical verification inference pipeline for VLAs. At deployment, our framework precomputes a diverse set of rephrased instructions from a Vision-Language-Model (VLM), repeatedly generates action candidates for each instruction, and then uses a verifier to select the optimal high-level prompt and low-level action chunks. Compared to scaling policy pre-training on the same data, our verification approach yields 22% gains in-distribution and 13% out-of-distribution on the SIMPLER benchmark, with a further 45% improvement in real-world experiments. On the PolaRiS benchmark, CoVer achieves 14% gains in task progress and 9% in success rate.

UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs. Many multimodal tasks, especially those involving complex spatial compositions, multiple interacting objects, or evolving instructions, require decomposing instructions, verifying intermediate results, and making iterative corrections. While test-time scaling (TTS) has demonstrated that allocating additional inference compute for iterative reasoning substantially improves language model performance, extending this paradigm to unified multimodal models remains an open challenge. We introduce UniT, a framework for multimodal chain-of-thought test-time scaling that enables a single unified model to reason, verify, and refine across multiple rounds. UniT combines agentic data synthesis, unified model training, and flexible test-time inference to elicit cognitive behaviors including verification, subgoal decomposition, and content memory. Our key findings are: (1) unified models trained on short reasoning trajectories generalize to longer inference chains at test time; (2) sequential chain-of-thought reasoning provides a more scalable and compute-efficient TTS strategy than parallel sampling; (3) training on generation and editing trajectories improves out-of-distribution visual reasoning. These results establish multimodal test-time scaling as an effective paradigm for advancing both generation and understanding in unified models.

AttentionRetriever: Attention Layers are Secretly Long Document Retrievers

Retrieval augmented generation (RAG) has been widely adopted to help Large Language Models (LLMs) to process tasks involving long documents. However, existing retrieval models are not designed for long document retrieval and fail to address several key challenges of long document retrieval, including context-awareness, causal dependence, and scope of retrieval. In this paper, we proposed AttentionRetriever, a novel long document retrieval model that leverages attention mechanism and entity-based retrieval to build context-aware embeddings for long document and determine the scope of retrieval. With extensive experiments, we found AttentionRetriever is able to outperform existing retrieval models on long document retrieval datasets by a large margin while remaining as efficient as dense retrieval models.

Agentic Test-Time Scaling for WebAgents

Test-time scaling has become a standard way to improve performance and boost reliability of neural network models. However, its behavior on agentic, multi-step tasks remains less well-understood: small per-step errors can compound over long horizons; and we find that naive policies that uniformly increase sampling show diminishing returns. In this work, we present CATTS, a simple technique for dynamically allocating compute for multi-step agents. We first conduct an empirical study of inference-time scaling for web agents. We find that uniformly increasing per-step compute quickly saturates in long-horizon environments. We then investigate stronger aggregation strategies, including an LLM-based Arbiter that can outperform naive voting, but that can overrule high-consensus decisions. We show that uncertainty statistics derived from the agent's own vote distribution (entropy and top-1/top-2 margin) correlate with downstream success and provide a practical signal for dynamic compute allocation. Based on these findings, we introduce Confidence-Aware Test-Time Scaling (CATTS), which uses vote-derived uncertainty to allocate compute only when decisions are genuinely contentious. CATTS improves performance on WebArena-Lite and GoBrowse by up to 9.1% over React while using up to 2.3x fewer tokens than uniform scaling, providing both efficiency gains and an interpretable decision rule.

Creative Ownership in the Age of AI

Copyright law focuses on whether a new work is "substantially similar" to an existing one, but generative AI can closely imitate style without copying content, a capability now central to ongoing litigation. We argue that existing definitions of infringement are ill-suited to this setting and propose a new criterion: a generative AI output infringes on an existing work if it could not have been generated without that work in its training corpus. To operationalize this definition, we model generative systems as closure operators mapping a corpus of existing works to an output of new works. AI generated outputs are \emph{permissible} if they do not infringe on any existing work according to our criterion. Our results characterize structural properties of permissible generation and reveal a sharp asymptotic dichotomy: when the process of organic creations is light-tailed, dependence on individual works eventually vanishes, so that regulation imposes no limits on AI generation; with heavy-tailed creations, regulation can be persistently constraining.

CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use

AI agents are increasingly used to solve real-world tasks by reasoning over multi-turn user interactions and invoking external tools. However, applying reinforcement learning to such settings remains difficult: realistic objectives often lack verifiable rewards and instead emphasize open-ended behaviors; moreover, RL for multi-turn, multi-step agentic tool use is still underexplored; and building and maintaining executable tool environments is costly, limiting scale and coverage. We propose CM2, an RL framework that replaces verifiable outcome rewards with checklist rewards. CM2 decomposes each turn's intended behavior into fine-grained binary criteria with explicit evidence grounding and structured metadata, turning open-ended judging into more stable classification-style decisions. To balance stability and informativeness, our method adopts a strategy of sparse reward assignment but dense evaluation criteria. Training is performed in a scalable LLM-simulated tool environment, avoiding heavy engineering for large tool sets. Experiments show that CM2 consistently improves over supervised fine-tuning. Starting from an 8B Base model and training on an 8k-example RL dataset, CM2 improves over the SFT counterpart by 8 points on tau^-Bench, by 10 points on BFCL-V4, and by 12 points on ToolSandbox. The results match or even outperform similarly sized open-source baselines, including the judging model. CM2 thus provides a scalable recipe for optimizing multi-turn, multi-step tool-using agents without relying on verifiable rewards. Code provided by the open-source community: https://github.com/namezhenzhang/CM2-RLCR-Tool-Agent.

Think like a Scientist: Physics-guided LLM Agent for Equation Discovery

Explaining observed phenomena through symbolic, interpretable formulas is a fundamental goal of science. Recently, large language models (LLMs) have emerged as promising tools for symbolic equation discovery, owing to their broad domain knowledge and strong reasoning capabilities. However, most existing LLM-based systems try to guess equations directly from data, without modeling the multi-step reasoning process that scientists often follow: first inferring physical properties such as symmetries, then using these as priors to restrict the space of candidate equations. We introduce KeplerAgent, an agentic framework that explicitly follows this scientific reasoning process. The agent coordinates physics-based tools to extract intermediate structure and uses these results to configure symbolic regression engines such as PySINDy and PySR, including their function libraries and structural constraints. Across a suite of physical equation benchmarks, KeplerAgent achieves substantially higher symbolic accuracy and greater robustness to noisy data than both LLM and traditional baselines.

On the implicit regularization of Langevin dynamics with projected noise

We study Langevin dynamics with noise projected onto the directions orthogonal to an isometric group action. This mathematical model is introduced to shed new light on the effects of symmetry on stochastic gradient descent for over-parametrized models. Our main result identifies a novel form of implicit regularization: when the initial and target density are both invariant under the group action, Langevin dynamics with projected noise is equivalent in law to Langevin dynamics with isotropic diffusion but with an additional drift term proportional to the negative log volume of the group orbit. We prove this result by constructing a coupling of the two processes via a third process on the group itself, and identify the additional drift as the mean curvature of the orbits.

A technical curriculum on language-oriented artificial intelligence in translation and specialised communication

This paper presents a technical curriculum on language-oriented artificial intelligence (AI) in the language and translation (L&T) industry. The curriculum aims to foster domain-specific technical AI literacy among stakeholders in the fields of translation and specialised communication by exposing them to the conceptual and technical/algorithmic foundations of modern language-oriented AI in an accessible way. The core curriculum focuses on 1) vector embeddings, 2) the technical foundations of neural networks, 3) tokenization and 4) transformer neural networks. It is intended to help users develop computational thinking as well as algorithmic awareness and algorithmic agency, ultimately contributing to their digital resilience in AI-driven work environments. The didactic suitability of the curriculum was tested in an AI-focused MA course at the Institute of Translation and Multilingual Communication at TH Koeln. Results suggest the didactic effectiveness of the curriculum, but participant feedback indicates that it should be embedded into higher-level didactic scaffolding - e.g., in the form of lecturer support - in order to enable optimal learning conditions.

"Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most

Despite speech recognition systems achieving low word error rates on standard benchmarks, they often fail on short, high-stakes utterances in real-world deployments. Here, we study this failure mode in a high-stakes task: the transcription of U.S. street names as spoken by U.S. participants. We evaluate 15 models from OpenAI, Deepgram, Google, and Microsoft on recordings from linguistically diverse U.S. speakers and find an average transcription error rate of 44%. We quantify the downstream impact of failed transcriptions by geographic locations and show that mis-transcriptions systematically cause errors for all speakers, but that routing distance errors are twice as large for non-English primary speakers compared to English primary speakers. To mitigate this harm, we introduce a synthetic data generation approach that produces diverse pronunciations of named entities using open-source text-to-speech models. Fine-tuning with less than 1,000 synthetic samples improves street name transcription accuracy by nearly 60% (relative to base models) for non-English primary speakers. Our results highlight a critical gap between benchmark performance and real-world reliability in speech systems and demonstrate a simple, scalable path to reducing high-stakes transcription errors.

AI Models

rednote-hilab/dots.ocr-1.5


license: mit library_name: dots_ocr_1_5 pipeline_tag: image-text-to-text tags:

  • image-to-text
  • ocr
  • document-parse
  • layout
  • table
  • formula
  • transformers
  • custom_code language:
  • en
  • zh
  • multilingual

<div align="center"> <p align="center"> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/logo.png" width="300"/> <p> <h1 align="center"> dots.ocr-1.5: Recognize Any Human Scripts and Symbols </h1>

HuggingFace GitHub

<div align="center"> <a href="https://dotsocr.xiaohongshu.com" target="_blank" rel="noopener noreferrer"><strong>🖥️ Live Demo</strong></a> | <a href="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/wechat.jpg" target="_blank" rel="noopener noreferrer"><strong>💬 WeChat</strong></a> | <a href="https://www.xiaohongshu.com/user/profile/683ffe42000000001d021a4c" target="_blank" rel="noopener noreferrer"><strong>📕 rednote</strong></a> </div> </div>

Introduction

We present dots.ocr-1.5, a 3B-parameter multimodal model composed of a 1.2B vision encoder and a 1.7B language model. Designed for universal accessibility, it possesses the capability to recognize virtually any human script. Beyond achieving state-of-the-art (SOTA) performance in standard multilingual document parsing among models of comparable size, dots.ocr-1.5 excels at converting structured graphics (e.g., charts and diagrams) directly into SVG code, parsing web screens and spotting scene text. Furthermore, the model demonstrates competitive performance in general OCR, object grounding & counting tasks.

  1. Stronger Document Parsing Performance: dots.ocr-1.5 maintains SOTA performance among latest OCR models, particularly on multilingual documents. Addressing the significant bias inherent in the detection & matching rules of certain benchmarks —which often fail to accurately reflect a model's true capabilities—we adopted an Elo score evaluation system. Under this metric, the performance landscape shifts significantly, highlighting the superior robustness of our model compared to conventional rankings.
  2. Unified Vision-Language Parsing: Visual languages (e.g., charts, graphics, chemical formulas, logos) encapsulate dense human knowledge, akin to natural language. dots.ocr-1.5 unifies the interpretation of these elements by parsing them directly into SVG code. We have validated the effectiveness of this approach, demonstrating impressive results in structural and semantic recognition.
  3. Broader and More General Capabilities: Compared to dots.ocr, dots.ocr-1.5 supports a significantly wider array of tasks. It extends beyond standard OCR to handle web screen parsing, scene text spotting, object grounding & counting, and other general OCR QA tasks.

Evaluation

1. Document Parsing

1.1 Elo Score of different bench between latest models

<table> <thead> <tr> <th>models</th> <th>olmOCR-Bench</th> <th>OmniDocBench (v1.5)</th> <th>XDocParse</th> </tr> </thead> <tbody> <tr> <td>GLM-OCR</td> <td>859.9</td> <td>937.5</td> <td>742.1</td> </tr> <tr> <td>PaddleOCR-VL-1.5</td> <td>873.6</td> <td>965.6</td> <td>797.6</td> </tr> <tr> <td>HuanyuanOCR</td> <td>978.9</td> <td>974.4</td> <td>895.9</td> </tr> <tr> <td>dots.ocr</td> <td>1027.4</td> <td>994.7</td> <td>1133.4</td> </tr> <!-- Highlighting dots.ocr-1.5 row with bold tags --> <tr> <td><strong>dots.ocr-1.5</strong></td> <td><strong>1089.0</strong></td> <td><strong>1025.8</strong></td> <td><strong>1157.1</strong></td> </tr> <tr> <td>Gemini 3 Pro</td> <td>1171.2</td> <td>1102.1</td> <td>1273.9</td> </tr> </tbody> </table>

Notes:

  • Results for Gemini 3 Pro, PaddleOCR-VL-1.5, and GLM-OCR were obtained via APIs, while HuanyuanOCR results were generated using local inference.
  • The Elo score evaluation was conducted using Gemini 3 Flash. The prompt can be found at: Elo Score Prompt. These results are consistent with the findings on ocrarena.

1.2 olmOCR-bench

<table> <thead> <tr> <th>Model</th> <th>ArXiv</th> <th>Old scans math</th> <th>Tables</th> <th>Old scans</th> <th>Headers & footers</th> <th>Multi column</th> <th>Long tiny text</th> <th>Base</th> <th>Overall</th> </tr> </thead> <tbody> <tr> <td>Mistral OCR API</td> <td>77.2</td> <td>67.5</td> <td>60.6</td> <td>29.3</td> <td>93.6</td> <td>71.3</td> <td>77.1</td> <td>99.4</td> <td>72.0±1.1</td> </tr> <tr> <td>Marker 1.10.1</td> <td>83.8</td> <td>66.8</td> <td>72.9</td> <td>33.5</td> <td>86.6</td> <td>80.0</td> <td>85.7</td> <td>99.3</td> <td>76.1±1.1</td> </tr> <tr> <td>MinerU 2.5.4*</td> <td>76.6</td> <td>54.6</td> <td>84.9</td> <td>33.7</td> <td>96.6</td> <td>78.2</td> <td>83.5</td> <td>93.7</td> <td>75.2±1.1</td> </tr> <tr> <td>DeepSeek-OCR</td> <td>77.2</td> <td>73.6</td> <td>80.2</td> <td>33.3</td> <td>96.1</td> <td>66.4</td> <td>79.4</td> <td>99.8</td> <td>75.7±1.0</td> </tr> <tr> <td>Nanonets-OCR2-3B</td> <td>75.4</td> <td>46.1</td> <td>86.8</td> <td>40.9</td> <td>32.1</td> <td>81.9</td> <td>93.0</td> <td>99.6</td> <td>69.5±1.1</td> </tr> <tr> <td>PaddleOCR-VL*</td> <td>85.7</td> <td>71.0</td> <td>84.1</td> <td>37.8</td> <td>97.0</td> <td>79.9</td> <td>85.7</td> <td>98.5</td> <td>80.0±1.0</td> </tr> <tr> <td>Infinity-Parser 7B*</td> <td>84.4</td> <td>83.8</td> <td>85.0</td> <td>47.9</td> <td>88.7</td> <td>84.2</td> <td>86.4</td> <td>99.8</td> <td>82.5±?</td> </tr> <tr> <td>olmOCR v0.4.0</td> <td>83.0</td> <td>82.3</td> <td>84.9</td> <td>47.7</td> <td>96.1</td> <td>83.7</td> <td>81.9</td> <td>99.7</td> <td>82.4±1.1</td> </tr> <tr> <td>Chandra OCR 0.1.0*</td> <td>82.2</td> <td>80.3</td> <td>88.0</td> <td>50.4</td> <td>90.8</td> <td>81.2</td> <td>92.3</td> <td>99.9</td> <td>83.1±0.9</td> </tr> <tr> <td>dots.ocr</td> <td>82.1</td> <td>64.2</td> <td>88.3</td> <td>40.9</td> <td>94.1</td> <td>82.4</td> <td>81.2</td> <td>99.5</td> <td>79.1±1.0</td> </tr> <tr> <td><strong>dots.ocr-1.5</strong></td> <td><strong>85.9</strong></td> <td><strong>85.5</strong></td> <td><strong>90.7</strong></td> <td>48.2</td> <td>94.0</td> <td><strong>85.3</strong></td> <td>81.6</td> <td>99.7</td> <td><strong>83.9±0.9</strong></td> </tr> </tbody> </table>

Note:

  • The metrics are from olmocr, and our own internal evaluations.
  • We delete the Page-header and Page-footer cells in the result markdown.

1.3 Other Benchmarks

<table> <thead> <tr> <th>Model Type</th> <th>Methods</th> <th>Size</th> <th>OmniDocBench(v1.5)<br>TextEdit↓</th> <th>OmniDocBench(v1.5)<br>Read OrderEdit↓</th> <th>pdf-parse-bench</th> </tr> </thead> <tbody> <!-- GeneralVLMs Group (Reversed Order, 3 rows) --> <tr> <td rowspan="3"><strong>GeneralVLMs</strong></td> <td>Gemini-2.5 Pro</td> <td>-</td> <td>0.075</td> <td>0.097</td> <td>9.06</td> </tr> <tr> <td>Qwen3-VL-235B-A22B-Instruct</td> <td>235B</td> <td>0.069</td> <td>0.068</td> <td><strong>9.71</strong></td> </tr> <tr> <td>gemini3pro</td> <td>-</td> <td>0.066</td> <td>0.079</td> <td>9.68</td> </tr> <!-- SpecializedVLMs Group (Reversed Order, 12 rows) --> <tr> <td rowspan="12"><strong>SpecializedVLMs</strong></td> <td>Mistral OCR</td> <td>-</td> <td>0.164</td> <td>0.144</td> <td>8.84</td> </tr> <tr> <td>Deepseek-OCR</td> <td>3B</td> <td>0.073</td> <td>0.086</td> <td>8.26</td> </tr> <tr> <td>MonkeyOCR-3B</td> <td>3B</td> <td>0.075</td> <td>0.129</td> <td>9.27</td> </tr> <tr> <td>OCRVerse</td> <td>4B</td> <td>0.058</td> <td>0.071</td> <td>--</td> </tr> <tr> <td>MonkeyOCR-pro-3B</td> <td>3B</td> <td>0.075</td> <td>0.128</td> <td>-</td> </tr> <tr> <td>MinerU2.5</td> <td>1.2B</td> <td>0.047</td> <td>0.044</td> <td>-</td> </tr> <tr> <td>PaddleOCR-VL</td> <td>0.9B</td> <td>0.035</td> <td>0.043</td> <td>9.51</td> </tr> <tr> <td>HunyuanOCR</td> <td>0.9B</td> <td>0.042</td> <td>-</td> <td>-</td> </tr> <tr> <td>PaddleOCR-VL1.5</td> <td>0.9B</td> <td>0.035</td> <td>0.042</td> <td>-</td> </tr> <tr> <td>GLMOCR</td> <td>0.9B</td> <td>0.04</td> <td>0.043</td> <td>-</td> </tr> <tr> <td>dots.ocr</td> <td>3B</td> <td>0.048</td> <td>0.053</td> <td>9.29</td> </tr> <tr> <td><u><strong>dots.ocr-1.5</strong></u></td> <td>3B</td> <td><strong>0.031</strong></td> <td><strong>0.029</strong></td> <td>9.54</td> </tr> </tbody> </table>

Note:

  • Metrics are sourced from OmniDocBench and other model publications. pdf-parse-bench results are reproduced by Qwen3-VL-235B-A22B-Instruct.
  • Formula and Table metrics for OmniDocBench1.5 are omitted due to their high sensitivity to detection and matching protocols.

2. Vision-Language Parsing

Visual languages (e.g., charts, graphics, chemical formulas, logos) encapsulate dense human knowledge. dots.ocr-1.5 unifies the interpretation of these elements by parsing them directly into SVG code.

<table> <thead> <tr> <th rowspan="2" style="text-align: left;">Methods</th> <th colspan="3">Unisvg</th> <th rowspan="2">Chartmimic</th> <th rowspan="2">Design2Code</th> <th rowspan="2">Genexam</th> <th rowspan="2">SciGen</th> <th rowspan="2">ChemDraw</th> </tr> <tr> <th>Low-Level</th> <th>High-Level</th> <th>Score</th> </tr> </thead> <tbody> <tr> <td style="text-align: left;">OCRVerse</td> <td>0.632</td> <td>0.852</td> <td>0.763</td> <td>0.799</td> <td>-</td> <td>-</td> <td>-</td> <td>0.881</td> </tr> <tr> <td style="text-align: left;">Gemini 3 Pro</td> <td>0.563</td> <td>0.850</td> <td>0.735</td> <td>0.788</td> <td>0.760</td> <td>0.756</td> <td>0.783</td> <td>0.839</td> </tr> <tr> <td style="text-align: left;">dots.ocr-1.5</td> <td>0.850</td> <td>0.923</td> <td>0.894</td> <td>0.772</td> <td>0.801</td> <td>0.664</td> <td>0.660</td> <td>0.790</td> </tr> <tr> <td style="text-align: left;"><strong>dots.ocr-1.5-svg</strong></td> <td><strong>0.860</strong></td> <td><strong>0.931</strong></td> <td><strong>0.902</strong></td> <td><strong>0.905</strong></td> <td><strong>0.834</strong></td> <td><strong>0.8</strong></td> <td><strong>0.797</strong></td> <td><strong>0.901</strong></td> </tr> </tbody> </table>

Note:

  • We use the ISVGEN metric from UniSVG to evaluate the parsing result. For benchmarks that do not natively support image parsing, we use the original images as input, and calculate the ISVGEN score between the rendered output and the original image.
  • OCRVerse results are derived from various code formats (e.g., SVG, Python), whereas results for Gemini 3 Pro and dots.ocr-1.5 are based specifically on SVG code.
  • Due to the capacity constraints of a 3B-parameter VLM, dots.ocr-1.5 may not excel in all tasks yet like svg. To complement this, we are simultaneously releasing dots.ocr-1.5-svg. We plan to further address these limitations in future updates.

3. General Vision Tasks

<table> <thead> <tr> <th>Model</th> <th>CharXiv_descriptive</th> <th>CharXiv_reasoning</th> <th>OCR_Reasoning</th> <th>infovqa</th> <th>docvqa</th> <th>ChartQA</th> <th>OCRBench</th> <th>AI2D</th> <th>CountBenchQA</th> <th>refcoco</th> </tr> </thead> <tbody> <tr> <td>Qwen3vl-2b-instruct</td> <td>62.3</td> <td>26.8</td> <td>-</td> <td>72.4</td> <td>93.3</td> <td>-</td> <td>85.8</td> <td>76.9</td> <td>88.4</td> <td>-</td> </tr> <tr> <td><strong>dots.ocr-1.5</strong></td> <td>77.4</td> <td>55.3</td> <td>22.85</td> <td>73.76</td> <td>91.85</td> <td>83.2</td> <td>86.0</td> <td>82.16</td> <td>94.46</td> <td>80.03</td> </tr> </tbody> </table>

Quick Start

1. Installation

Install dots.ocr-1.5

conda create -n dots_ocr python=3.12
conda activate dots_ocr

git clone https://github.com/rednote-hilab/dots.ocr.git
cd dots.ocr

# Install pytorch, see https://pytorch.org/get-started/previous-versions/ for your cuda version
pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu128
pip install -e .

If you have trouble with the installation, try our Docker Image for an easier setup, and follow these steps:

git clone https://github.com/rednote-hilab/dots.ocr.git
cd dots.ocr
pip install -e .

Download Model Weights

💡Note: Please use a directory name without periods (e.g., DotsOCR_1_5 instead of dots.ocr-1.5) for the model save path. This is a temporary workaround pending our integration with Transformers.

python3 tools/download_model.py

2. Deployment

vLLM inference

We highly recommend using vllm for deployment and inference.

# launch vllm server
## dots.ocr-1.5
CUDA_VISIBLE_DEVICES=0 vllm serve rednote-hilab/dots.ocr-1.5 --tensor-parallel-size 1 --gpu-memory-utilization 0.9 --chat-template-content-format string --served-model-name model --trust-remote-code

## dots.ocr-1.5-svg
CUDA_VISIBLE_DEVICES=0 vllm serve rednote-hilab/dots.ocr-1.5-svg --tensor-parallel-size 1 --gpu-memory-utilization 0.9 --chat-template-content-format string --served-model-name model --trust-remote-code

# vllm api demo
## document parsing
python3 ./demo/demo_vllm.py --prompt_mode prompt_layout_all_en
## web parsing 
python3 ./demo/demo_vllm.py --prompt_mode prompt_web_parsing --image_path ./assets/showcase_dots_ocr_1_5/origin/webpage_1.png
## scene spoting
python3 ./demo/demo_vllm.py --prompt_mode prompt_scene_spotting --image_path ./assets/showcase_dots_ocr_1_5/origin/scene_1.jpg
## image parsing with svg code
python3 ./demo/demo_vllm_svg.py --prompt_mode prompt_image_to_svg 
## general qa
python3 ./demo/demo_vllm_general.py

Hugginface inference

python3 demo/demo_hf.py
<details> <summary><b>Hugginface inference details</b></summary>
import torch
from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer
from qwen_vl_utils import process_vision_info
from dots_ocr.utils import dict_promptmode_to_prompt

model_path = "./weights/DotsOCR_1_5"
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

image_path = "demo/demo_image1.jpg"
prompt = """Please output the layout information from the PDF image, including each layout element's bbox, its category, and the corresponding text content within the bbox.

1. Bbox format: [x1, y1, x2, y2]

2. Layout Categories: The possible categories are ['Caption', 'Footnote', 'Formula', 'List-item', 'Page-footer', 'Page-header', 'Picture', 'Section-header', 'Table', 'Text', 'Title'].

3. Text Extraction & Formatting Rules:
    - Picture: For the 'Picture' category, the text field should be omitted.
    - Formula: Format its text as LaTeX.
    - Table: Format its text as HTML.
    - All Others (Text, Title, etc.): Format their text as Markdown.

4. Constraints:
    - The output text must be the original text from the image, with no translation.
    - All layout elements must be sorted according to human reading order.

5. Final Output: The entire output must be a single JSON object.
"""

messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": image_path
                },
                {"type": "text", "text": prompt}
            ]
        }
    ]

# Preparation for inference
text = processor.apply_chat_template(
    messages, 
    tokenize=False, 
    add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)

inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=24000)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

</details>

3. Document Parse

Based on vLLM server, you can parse an image or a pdf file using the following commands:


# Parse all layout info, both detection and recognition
# Parse a single image
python3 dots_ocr/parser.py demo/demo_image1.jpg
# Parse a single PDF
python3 dots_ocr/parser.py demo/demo_pdf1.pdf  --num_thread 64  # try bigger num_threads for pdf with a large number of pages

# Layout detection only
python3 dots_ocr/parser.py demo/demo_image1.jpg --prompt prompt_layout_only_en

# Parse text only, except Page-header and Page-footer
python3 dots_ocr/parser.py demo/demo_image1.jpg --prompt prompt_ocr


<details> <summary><b>Output Results</b></summary>
  1. Structured Layout Data (demo_image1.json): A JSON file containing the detected layout elements, including their bounding boxes, categories, and extracted text.
  2. Processed Markdown File (demo_image1.md): A Markdown file generated from the concatenated text of all detected cells.
    • An additional version, demo_image1_nohf.md, is also provided, which excludes page headers and footers for compatibility with benchmarks like Omnidocbench and olmOCR-bench.
  3. Layout Visualization (demo_image1.jpg): The original image with the detected layout bounding boxes drawn on it.
</details>

4. Demo

Have fun with the live demo.

Examples for document parsing

<img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/formula1.png" alt="formula1.png" border="0" /> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/table3.png" alt="table3.png" border="0" /> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/Tibetan.png" alt="Tibetan.png" border="0" /> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/tradition_zh.png" alt="tradition_zh.png" border="0" /> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/nl.png" alt="nl.png" border="0" /> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/kannada.png" alt="kannada.png" border="0" /> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/russian.png" alt="russian.png" border="0" />

Examples for image parsing

<img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase_dots_ocr_1_5/result/svg_1.png" alt="svg_1.png" border="0" /> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase_dots_ocr_1_5/result/svg_2.png" alt="svg_2.png" border="0" /> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase_dots_ocr_1_5/result/svg_4.png" alt="svg_4.png" border="0" /> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase_dots_ocr_1_5/result/svg_5.png" alt="svg_5.png" border="0" /> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase_dots_ocr_1_5/result/svg_6.png" alt="svg_6.png" border="0" />

Note:

  • Inferenced by dots.ocr-1.5-svg

Example for web parsing

<img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase_dots_ocr_1_5/result/webpage_1.png" alt="webpage_1.png" border="0" /> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase_dots_ocr_1_5/result/webpage_2.png" alt="webpage_2.png" border="0" />

Examples for scene spotting

<img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase_dots_ocr_1_5/result/scene_1.png" alt="scene_1.png" border="0" /> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase_dots_ocr_1_5/result/scene_2.png" alt="scene_2.png" border="0" />

Limitation & Future Work

  • Complex Document Elements:

    • Table&Formula: The extraction of complex tables and mathematical formulas persists as a difficult task given the model's compact architecture.
    • Picture: We have adopted an SVG code representation for parsing structured graphics; however, the performance has yet to achieve the desired level of robustness.
  • Parsing Failures: While we have reduced the rate of parsing failures compared to the previous version, these issues may still occur occasionally. We remain committed to further resolving these edge cases in future updates.

Author: rednote-hilab

Likes: 23

Downloads: 0

Tags: dots_ocr_1_5, safetensors, dots_ocr, text-generation, image-to-text, ocr, document-parse, layout, table, formula, transformers, custom_code, image-text-to-text, conversational, en, zh, multilingual, license:mit, region:us

rednote-hilab/dots.ocr-1.5-svg


license: mit library_name: dots_ocr_1_5 pipeline_tag: image-text-to-text tags:

  • image-to-text
  • ocr
  • document-parse
  • layout
  • table
  • formula
  • transformers
  • custom_code language:
  • en
  • zh
  • multilingual

<div align="center"> <p align="center"> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/logo.png" width="300"/> <p> <h1 align="center"> dots.ocr-1.5: Recognize Any Human Scripts and Symbols </h1>

HuggingFace GitHub

<div align="center"> <a href="https://dotsocr.xiaohongshu.com" target="_blank" rel="noopener noreferrer"><strong>🖥️ Live Demo</strong></a> | <a href="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/wechat.jpg" target="_blank" rel="noopener noreferrer"><strong>💬 WeChat</strong></a> | <a href="https://www.xiaohongshu.com/user/profile/683ffe42000000001d021a4c" target="_blank" rel="noopener noreferrer"><strong>📕 rednote</strong></a> </div> </div>

Introduction

We present dots.ocr-1.5-svg, a 3B-parameter multimodal model composed of a 1.2B vision encoder and a 1.7B language model. As an enhanced version of dots.ocr-1.5, this model is specifically optimized for converting structured graphics (e.g., charts and diagrams) directly into SVG code. We have validated the effectiveness of this approach, demonstrating impressive results in structural and semantic recognition.

Evaluation of Vision-Language Parsing

Visual languages (e.g., charts, graphics, chemical formulas, logos) encapsulate dense human knowledge. dots.ocr-1.5 unifies the interpretation of these elements by parsing them directly into SVG code.

<table> <thead> <tr> <th rowspan="2" style="text-align: left;">Methods</th> <th colspan="3">Unisvg</th> <th rowspan="2">Chartmimic</th> <th rowspan="2">Design2Code</th> <th rowspan="2">Genexam</th> <th rowspan="2">SciGen</th> <th rowspan="2">ChemDraw</th> </tr> <tr> <th>Low-Level</th> <th>High-Level</th> <th>Score</th> </tr> </thead> <tbody> <tr> <td style="text-align: left;">OCRVerse</td> <td>0.632</td> <td>0.852</td> <td>0.763</td> <td>0.799</td> <td>-</td> <td>-</td> <td>-</td> <td>0.881</td> </tr> <tr> <td style="text-align: left;">Gemini 3 Pro</td> <td>0.563</td> <td>0.850</td> <td>0.735</td> <td>0.788</td> <td>0.760</td> <td>0.756</td> <td>0.783</td> <td>0.839</td> </tr> <tr> <td style="text-align: left;">dots.ocr-1.5</td> <td>0.850</td> <td>0.923</td> <td>0.894</td> <td>0.772</td> <td>0.801</td> <td>0.664</td> <td>0.660</td> <td>0.790</td> </tr> <tr> <td style="text-align: left;"><strong>dots.ocr-1.5-svg</strong></td> <td><strong>0.860</strong></td> <td><strong>0.931</strong></td> <td><strong>0.902</strong></td> <td><strong>0.905</strong></td> <td><strong>0.834</strong></td> <td><strong>0.8</strong></td> <td><strong>0.797</strong></td> <td><strong>0.901</strong></td> </tr> </tbody> </table>

Note:

  • We use the ISVGEN metric from UniSVG to evaluate the parsing result. For benchmarks that do not natively support image parsing, we use the original images as input, and calculate the ISVGEN score between the rendered output and the original image.
  • OCRVerse results are derived from various code formats (e.g., SVG, Python), whereas results for Gemini 3 Pro and dots.ocr-1.5 are based specifically on SVG code.
  • Due to the capacity constraints of a 3B-parameter VLM, dots.ocr-1.5 may not excel in all tasks yet like svg. To complement this, we are simultaneously releasing dots.ocr-1.5-svg. We plan to further address these limitations in future updates.

Quick Start

1. Installation

Install dots.ocr-1.5

conda create -n dots_ocr python=3.12
conda activate dots_ocr

git clone https://github.com/rednote-hilab/dots.ocr.git
cd dots.ocr

# Install pytorch, see https://pytorch.org/get-started/previous-versions/ for your cuda version
pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu128
pip install -e .

If you have trouble with the installation, try our Docker Image for an easier setup, and follow these steps:

git clone https://github.com/rednote-hilab/dots.ocr.git
cd dots.ocr
pip install -e .

Download Model Weights

💡Note: Please use a directory name without periods (e.g., DotsOCR_1_5 instead of dots.ocr-1.5) for the model save path. This is a temporary workaround pending our integration with Transformers.

python3 tools/download_model.py

2. Deployment

vLLM inference

We highly recommend using vllm for deployment and inference.

# launch vllm server
## dots.ocr-1.5-svg
CUDA_VISIBLE_DEVICES=0 vllm serve rednote-hilab/dots.ocr-1.5-svg --tensor-parallel-size 1 --gpu-memory-utilization 0.9 --chat-template-content-format string --served-model-name model --trust-remote-code

# vllm api demo
## image parsing with svg code
python3 ./demo/demo_vllm_svg.py --prompt_mode prompt_image_to_svg 

4. Demo

Have fun with the live demo.

Examples for image parsing

<img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase_dots_ocr_1_5/result/svg_1.png" alt="svg_1.png" border="0" /> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase_dots_ocr_1_5/result/svg_2.png" alt="svg_2.png" border="0" /> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase_dots_ocr_1_5/result/svg_4.png" alt="svg_4.png" border="0" /> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase_dots_ocr_1_5/result/svg_5.png" alt="svg_5.png" border="0" /> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase_dots_ocr_1_5/result/svg_6.png" alt="svg_6.png" border="0" />

Limitation & Future Work

  • Complex Document Elements:

    • Table&Formula: The extraction of complex tables and mathematical formulas persists as a difficult task given the model's compact architecture.
    • Picture: We have adopted an SVG code representation for parsing structured graphics; however, the performance has yet to achieve the desired level of robustness.
  • Parsing Failures: While we have reduced the rate of parsing failures compared to the previous version, these issues may still occur occasionally. We remain committed to further resolving these edge cases in future updates.

Author: rednote-hilab

Likes: 3

Downloads: 0

Tags: dots_ocr_1_5, safetensors, dots_ocr, text-generation, image-to-text, ocr, document-parse, layout, table, formula, transformers, custom_code, image-text-to-text, conversational, en, zh, multilingual, license:mit, region:us

lightx2v/Qwen-Image-Edit-Causal


license: apache-2.0 pipeline_tag: image-text-to-image tags:

  • qwen-image;
  • text-to-image
  • image-edit
  • causal-image-edit

We employ block causal attention to improve inference speed of Qwen-Image-Edit-2511.

For usage instructions, please refer to Qwen-Image-Edit-Causal.

Author: lightx2v

Likes: 3

Downloads: 0

Tags: diffusers, safetensors, qwen-image;, text-to-image, image-edit, causal-image-edit, image-text-to-image, license:apache-2.0, diffusers:QwenImageEditPlusPipeline, region:us

heretic-org/Qwen3-4B-Thinking-2507-heretic


library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507/blob/main/LICENSE pipeline_tag: text-generation tags:

  • heretic
  • uncensored
  • decensored
  • abliterated base_model:
  • Qwen/Qwen3-4B-Thinking-2507

This is a decensored version of Qwen/Qwen3-4B-Thinking-2507, made using Heretic v1.2.0

Abliteration parameters

| Parameter | Value | | :-------- | :---: | | direction_index | 20.84 | | attn.o_proj.max_weight | 2.31 | | attn.o_proj.max_weight_position | 21.67 | | attn.o_proj.min_weight | 1.97 | | attn.o_proj.min_weight_distance | 12.21 | | mlp.down_proj.max_weight | 3.46 | | mlp.down_proj.max_weight_position | 24.50 | | mlp.down_proj.min_weight | 1.90 | | mlp.down_proj.min_weight_distance | 1.80 |

Performance

| Metric | This model | Original model (Qwen/Qwen3-4B-Thinking-2507) | | :----- | :--------: | :---------------------------: | | KL divergence | 0.0033 | 0 (by definition) | | Refusals | 5/100 | 99/100 |


Qwen3-4B-Thinking-2507

<a href="https://chat.qwen.ai/" target="_blank" style="margin: 2px;"> <img alt="Chat" src="https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5" style="display: inline-block; vertical-align: middle;"/> </a>

Highlights

Over the past three months, we have continued to scale the thinking capability of Qwen3-4B, improving both the quality and depth of reasoning. We are pleased to introduce Qwen3-4B-Thinking-2507, featuring the following key enhancements:

  • Significantly improved performance on reasoning tasks, including logical reasoning, mathematics, science, coding, and academic benchmarks that typically require human expertise.
  • Markedly better general capabilities, such as instruction following, tool usage, text generation, and alignment with human preferences.
  • Enhanced 256K long-context understanding capabilities.

NOTE: This version has an increased thinking length. We strongly recommend its use in highly complex reasoning tasks.

image/jpeg

Model Overview

Qwen3-4B-Thinking-2507 has the following features:

  • Type: Causal Language Models
  • Training Stage: Pretraining & Post-training
  • Number of Parameters: 4.0B
  • Number of Paramaters (Non-Embedding): 3.6B
  • Number of Layers: 36
  • Number of Attention Heads (GQA): 32 for Q and 8 for KV
  • Context Length: 262,144 natively.

NOTE: This model supports only thinking mode. Meanwhile, specifying enable_thinking=True is no longer required.

Additionally, to enforce model thinking, the default chat template automatically includes <think>. Therefore, it is normal for the model's output to contain only </think> without an explicit opening <think> tag.

For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation.

Performance

| | Qwen3-30B-A3B Thinking | Qwen3-4B Thinking | Qwen3-4B-Thinking-2507 | |--- | --- | --- | --- | | Knowledge | | | | MMLU-Pro | 78.5 | 70.4 | 74.0 | | MMLU-Redux | 89.5 | 83.7 | 86.1 | | GPQA | 65.8 | 55.9 | 65.8 | | SuperGPQA | 51.8 | 42.7 | 47.8 | | Reasoning | | | | AIME25 | 70.9 | 65.6 | 81.3 | | HMMT25 | 49.8 | 42.1 | 55.5 | | LiveBench 20241125 | 74.3 | 63.6 | 71.8 | | Coding | | | | LiveCodeBench v6 (25.02-25.05) | 57.4 | 48.4 | 55.2 | | CFEval | 1940 | 1671 | 1852 | | OJBench | 20.7 | 16.1 | 17.9 | | Alignment | | | | IFEval | 86.5 | 81.9 | 87.4 | | Arena-Hard v2$ | 36.3 | 13.7 | 34.9 | | Creative Writing v3 | 79.1 | 61.1 | 75.6 | | WritingBench | 77.0 | 73.5 | 83.3 | | Agent | | | | BFCL-v3 | 69.1 | 65.9 | 71.2 | | TAU1-Retail | 61.7 | 33.9 | 66.1 | | TAU1-Airline | 32.0 | 32.0 | 48.0 | | TAU2-Retail | 34.2 | 38.6 | 53.5 | | TAU2-Airline | 36.0 | 28.0 | 58.0 | | TAU2-Telecom | 22.8 | 17.5 | 27.2 | | Multilingualism | | | | MultiIF | 72.2 | 66.3 | 77.3 | | MMLU-ProX | 73.1 | 61.0 | 64.2 | | INCLUDE | 71.9 | 61.8 | 64.4 | | PolyMATH | 46.1 | 40.0 | 46.2 |

$ For reproducibility, we report the win rates evaluated by GPT-4.1.

& For highly challenging tasks (including PolyMATH and all reasoning and coding tasks), we use an output length of 81,920 tokens. For all other tasks, we set the output length to 32,768.

Quickstart

The code of Qwen3 has been in the latest Hugging Face transformers and we advise you to use the latest version of transformers.

With transformers<4.51.0, you will encounter the following error:

KeyError: 'qwen3'

The following contains a code snippet illustrating how to use the model generate content based on given inputs.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-4B-Thinking-2507"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content) # no opening <think> tag
print("content:", content)

For deployment, you can use sglang>=0.4.6.post1 or vllm>=0.8.5 or to create an OpenAI-compatible API endpoint:

  • SGLang:
    python -m sglang.launch_server --model-path Qwen/Qwen3-4B-Thinking-2507 --context-length 262144  --reasoning-parser deepseek-r1
    
  • vLLM:
    vllm serve Qwen/Qwen3-4B-Thinking-2507 --max-model-len 262144 --enable-reasoning --reasoning-parser deepseek_r1
    

Note: If you encounter out-of-memory (OOM) issues, you may consider reducing the context length to a smaller value. However, since the model may require longer token sequences for reasoning, we strongly recommend using a context length greater than 131,072 when possible.

For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3.

Agentic Use

Qwen3 excels in tool calling capabilities. We recommend using Qwen-Agent to make the best use of agentic ability of Qwen3. Qwen-Agent encapsulates tool-calling templates and tool-calling parsers internally, greatly reducing coding complexity.

To define the available tools, you can use the MCP configuration file, use the integrated tool of Qwen-Agent, or integrate other tools by yourself.

from qwen_agent.agents import Assistant

# Define LLM
# Using OpenAI-compatible API endpoint. It is recommended to disable the reasoning and the tool call parsing
# functionality of the deployment frameworks and let Qwen-Agent automate the related operations. For example, 
# `VLLM_USE_MODELSCOPE=true vllm serve Qwen/Qwen3-4B-Thinking-2507 --served-model-name Qwen3-4B-Thinking-2507 --max-model-len 262144`.
llm_cfg = {
    'model': 'Qwen3-4B-Thinking-2507',

    # Use a custom endpoint compatible with OpenAI API:
    'model_server': 'http://localhost:8000/v1',  # api_base without reasoning and tool call parsing
    'api_key': 'EMPTY',
    'generate_cfg': {
        'thought_in_content': True,
    },
}

# Define Tools
tools = [
    {'mcpServers': {  # You can specify the MCP configuration file
            'time': {
                'command': 'uvx',
                'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai']
            },
            "fetch": {
                "command": "uvx",
                "args": ["mcp-server-fetch"]
            }
        }
    },
  'code_interpreter',  # Built-in tools
]

# Define Agent
bot = Assistant(llm=llm_cfg, function_list=tools)

# Streaming generation
messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}]
for responses in bot.run(messages=messages):
    pass
print(responses)

Best Practices

To achieve optimal performance, we recommend the following settings:

  1. Sampling Parameters:

    • We suggest using Temperature=0.6, TopP=0.95, TopK=20, and MinP=0.
    • For supported frameworks, you can adjust the presence_penalty parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance.
  2. Adequate Output Length: We recommend using an output length of 32,768 tokens for most queries. For benchmarking on highly complex problems, such as those found in math and programming competitions, we suggest setting the max output length to 81,920 tokens. This provides the model with sufficient space to generate detailed and comprehensive responses, thereby enhancing its overall performance.

  3. Standardize Output Format: We recommend using prompts to standardize model outputs when benchmarking.

    • Math Problems: Include "Please reason step by step, and put your final answer within \boxed{}." in the prompt.
    • Multiple-Choice Questions: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the answer field with only the choice letter, e.g., "answer": "C"."
  4. No Thinking Content in History: In multi-turn conversations, the historical model output should only include the final output part and does not need to include the thinking content. It is implemented in the provided chat template in Jinja2. However, for frameworks that do not directly use the Jinja2 chat template, it is up to the developers to ensure that the best practice is followed.

Citation

If you find our work helpful, feel free to give us a cite.

@misc{qwen3technicalreport,
      title={Qwen3 Technical Report}, 
      author={Qwen Team},
      year={2025},
      eprint={2505.09388},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.09388}, 
}

Author: heretic-org

Likes: 3

Downloads: 0

Tags: transformers, safetensors, qwen3, text-generation, heretic, uncensored, decensored, abliterated, conversational, arxiv:2505.09388, base_model:Qwen/Qwen3-4B-Thinking-2507, base_model:finetune:Qwen/Qwen3-4B-Thinking-2507, license:apache-2.0, text-generation-inference, endpoints_compatible, region:us

Guilherme34/Firefly-V3-Q6_K-GGUF


base_model: Guilherme34/Firefly-V3 tags:

  • merge
  • mergekit
  • lazymergekit
  • Guilherme34/Firefly
  • SicariusSicariiStuff/Impish_LLAMA_3B
  • llama-cpp
  • gguf-my-repo library_name: transformers

Guilherme34/Firefly-V3-Q6_K-GGUF

This model was converted to GGUF format from Guilherme34/Firefly-V3 using llama.cpp via the ggml.ai's GGUF-my-repo space. Refer to the original model card for more details on the model.

Use with llama.cpp

Install llama.cpp through brew (works on Mac and Linux)

brew install llama.cpp

Invoke the llama.cpp server or the CLI.

CLI:

llama-cli --hf-repo Guilherme34/Firefly-V3-Q6_K-GGUF --hf-file firefly-v3-q6_k.gguf -p "The meaning to life and the universe is"

Server:

llama-server --hf-repo Guilherme34/Firefly-V3-Q6_K-GGUF --hf-file firefly-v3-q6_k.gguf -c 2048

Note: You can also use this checkpoint directly through the usage steps listed in the Llama.cpp repo as well.

Step 1: Clone llama.cpp from GitHub.

git clone https://github.com/ggerganov/llama.cpp

Step 2: Move into the llama.cpp folder and build it with LLAMA_CURL=1 flag along with other hardware-specific flags (for ex: LLAMA_CUDA=1 for Nvidia GPUs on Linux).

cd llama.cpp && LLAMA_CURL=1 make

Step 3: Run inference through the main binary.

./llama-cli --hf-repo Guilherme34/Firefly-V3-Q6_K-GGUF --hf-file firefly-v3-q6_k.gguf -p "The meaning to life and the universe is"

or

./llama-server --hf-repo Guilherme34/Firefly-V3-Q6_K-GGUF --hf-file firefly-v3-q6_k.gguf -c 2048

Author: Guilherme34

Likes: 2

Downloads: 0

Tags: transformers, gguf, merge, mergekit, lazymergekit, Guilherme34/Firefly, SicariusSicariiStuff/Impish_LLAMA_3B, llama-cpp, gguf-my-repo, base_model:Guilherme34/Firefly-V3, base_model:quantized:Guilherme34/Firefly-V3, endpoints_compatible, region:us, conversational

Guilherme34/Firefly-V3


base_model:

  • Guilherme34/Firefly-V2.5 tags:
  • merge
  • mergekit
  • lazymergekit
  • Guilherme34/Firefly
  • SicariusSicariiStuff/Impish_LLAMA_3B library_name: transformers

Firefly-V3

Firefly-V3 is a Roleplay model

Author: Guilherme34

Likes: 2

Downloads: 0

Tags: transformers, safetensors, llama, text-generation, merge, mergekit, lazymergekit, Guilherme34/Firefly, SicariusSicariiStuff/Impish_LLAMA_3B, conversational, base_model:Guilherme34/Firefly-V2.5, base_model:finetune:Guilherme34/Firefly-V2.5, text-generation-inference, endpoints_compatible, region:us

Guilherme34/Firefly-V2.5-Q6_K-GGUF


base_model: Guilherme34/Firefly-V2.5 tags:

  • merge
  • mergekit
  • lazymergekit
  • Guilherme34/Firefly
  • SicariusSicariiStuff/Impish_LLAMA_3B
  • llama-cpp
  • gguf-my-repo library_name: transformers

Guilherme34/Firefly-V2.5-Q6_K-GGUF

This model was converted to GGUF format from Guilherme34/Firefly-V2.5 using llama.cpp via the ggml.ai's GGUF-my-repo space. Refer to the original model card for more details on the model.

Use with llama.cpp

Install llama.cpp through brew (works on Mac and Linux)

brew install llama.cpp

Invoke the llama.cpp server or the CLI.

CLI:

llama-cli --hf-repo Guilherme34/Firefly-V2.5-Q6_K-GGUF --hf-file firefly-v2.5-q6_k.gguf -p "The meaning to life and the universe is"

Server:

llama-server --hf-repo Guilherme34/Firefly-V2.5-Q6_K-GGUF --hf-file firefly-v2.5-q6_k.gguf -c 2048

Note: You can also use this checkpoint directly through the usage steps listed in the Llama.cpp repo as well.

Step 1: Clone llama.cpp from GitHub.

git clone https://github.com/ggerganov/llama.cpp

Step 2: Move into the llama.cpp folder and build it with LLAMA_CURL=1 flag along with other hardware-specific flags (for ex: LLAMA_CUDA=1 for Nvidia GPUs on Linux).

cd llama.cpp && LLAMA_CURL=1 make

Step 3: Run inference through the main binary.

./llama-cli --hf-repo Guilherme34/Firefly-V2.5-Q6_K-GGUF --hf-file firefly-v2.5-q6_k.gguf -p "The meaning to life and the universe is"

or

./llama-server --hf-repo Guilherme34/Firefly-V2.5-Q6_K-GGUF --hf-file firefly-v2.5-q6_k.gguf -c 2048

Author: Guilherme34

Likes: 2

Downloads: 0

Tags: transformers, gguf, merge, mergekit, lazymergekit, Guilherme34/Firefly, SicariusSicariiStuff/Impish_LLAMA_3B, llama-cpp, gguf-my-repo, base_model:Guilherme34/Firefly-V2.5, base_model:quantized:Guilherme34/Firefly-V2.5, endpoints_compatible, region:us, conversational

bochen2079/AORTA-7B-GGUF


license: mit language:

  • en base_model: Qwen/Qwen2.5-7B-Instruct tags:
  • organ-procurement
  • medical
  • fine-tuned
  • qlora
  • gguf
  • llama-cpp
  • lm-studio pipeline_tag: text-generation

AORTA-7B-GGUF

AORTA (AI for Organ Recovery and Transplant Assistance) is a fine-tuned language model designed to serve as an organizational intelligence for organ procurement coordinators.

https://github.com/bochen2029-pixel/AORTA

Model Details

  • Base Model: Qwen2.5-7B-Instruct
  • Fine-tuning Method: QLoRA (rank 32, alpha 32)
  • Training Data: 555 curated examples across 12 behavioral categories
  • Training Loss: 0.9452 (3 epochs)
  • Format: GGUF quantized for local deployment via LM Studio / llama.cpp

Quantizations

| File | Quant | Size | Target Hardware | |------|-------|------|-----------------| | aorta-q4_k_m.gguf | Q4_K_M | ~4.4 GB | 12 GB VRAM (recommended) | | aorta-q5_k_m.gguf | Q5_K_M | ~5.5 GB | 12 GB VRAM (higher quality) | | aorta-q3_k_m.gguf | Q3_K_M | ~3.5 GB | 8 GB VRAM (max context room) |

What AORTA Does

AORTA behaves like a seasoned OPO supervisor — calm, knowledgeable, brief by default. Key behaviors:

  • Confidence tagging — tags policy answers as HIGH, MODERATE, or LOW confidence
  • Human Line — advises but never decides; refuses to make calls, contact families, or take clinical actions
  • Anti-sycophancy — pushes back when wrong, resists flattery, maintains honest calibration
  • Clinical redirect — defers medical judgment to physicians and coordinators
  • Citation integrity — never fabricates policy citations; says "I don't know" when uncertain
  • Colleague voice — no chatbot filler, no corporate tone, no emoji

Training Categories

The 555 training examples cover 12 behavioral categories:

  1. Policy (High Confidence) — well-established OPTN/CMS/UAGA guidance
  2. Policy (Moderate Confidence) — nuanced or evolving policy areas
  3. Policy (Low Confidence) — edge cases where AORTA acknowledges uncertainty
  4. Human Line — refusing to take actions that require human authority
  5. Clinical Outside Scope — redirecting medical decisions to physicians
  6. Emotional Moments — supporting coordinators through grief and burnout
  7. Time-Critical — structured responses under time pressure
  8. New Coordinator — teaching mode for onboarding staff
  9. Anti-Sycophancy — resisting praise inflation and maintaining honesty
  10. Voice/Brevity — short, direct answers for quick reference
  11. Documentation — drafting case narratives, handoff notes, reports
  12. Self-Knowledge — honest about architecture, limitations, and capabilities

Usage

LM Studio

  1. Download the GGUF file appropriate for your hardware
  2. Load in LM Studio
  3. Set the system prompt:
You are AORTA (AI for Organ Recovery and Transplant Assistance). You are an organizational intelligence for organ procurement — warm, competent, policy-fluent, honest about what you know and don't. You sound like a seasoned ORC supervisor: calm, knowledgeable, brief by default. You tag confidence (HIGH/MODERATE/LOW) on policy answers. You never fabricate citations. You never cross the Human Line — you advise, you don't decide. You never use chatbot filler phrases. You redirect clinical decisions to physicians and coordinators. You are a colleague, not a service.
  1. Start querying

llama.cpp

./llama-cli -m aorta-q4_k_m.gguf --system-prompt "You are AORTA..." -p "What are the OPTN requirements for DCD organ recovery?"

Limitations

  • Knowledge cutoff from base model training — may not reflect the latest OPTN policy updates
  • No access to DonorNet, hospital EMRs, or any external systems
  • Cannot make clinical decisions — always defers to physicians
  • No memory between sessions
  • Should be used as a supplement to, not replacement for, institutional policy knowledge

License

MIT — free to use, modify, and deploy.

Links

  • Training code and dataset: GitHub

Author: bochen2079

Likes: 2

Downloads: 0

Tags: gguf, organ-procurement, medical, fine-tuned, qlora, llama-cpp, lm-studio, text-generation, en, base_model:Qwen/Qwen2.5-7B-Instruct, base_model:quantized:Qwen/Qwen2.5-7B-Instruct, license:mit, endpoints_compatible, region:us, conversational

QuantTrio/MiniMax-M2.5-AWQ


license: other license_name: modified-mit license_link: https://github.com/MiniMax-AI/MiniMax-M2.5/blob/main/LICENSE library_name: transformers pipeline_tag: text-generation tags:

  • vLLM
  • AWQ base_model:
    • MiniMax/MiniMax-M2.5 base_model_relation: quantized

MiniMax-M2.5-AWQ

Base model: MiniMax/MiniMax-M2.5

This repo quantizes the model using data-free quantization (no calibration dataset required).

【Dependencies / Installation】

vllm>=0.13.0

As of 2026-02-15, make sure your system has cuda12.8 installed.

Then, create a fresh Python environment (e.g. python3.12 venv) and run:

pip install uv
uv pip install vllm

vLLM Official Guide

【vLLM Startup Command】

<i>Note: When launching with TP=8, include --enable-expert-parallel; otherwise the expert tensors wouldn’t be evenly sharded across GPU devices.</i>

export VLLM_USE_DEEP_GEMM=0
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_SAMPLER=0
export OMP_NUM_THREADS=4

vllm serve \
    __YOUR_PATH__/tclf90/MiniMax-M2.5-AWQ \
    --served-model-name MY_MODEL \
    --swap-space 16 \
    --max-num-seqs 32 \
    --max-model-len 32768  \
    --gpu-memory-utilization 0.9 \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \ 
    --enable-auto-tool-choice \
    --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2_append_think \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000

【Logs】

2026-02-15
1. Initial commit

【Model Files】

| File Size | Last Updated | |-----------|--------------| | 122GiB | 2026-02-15 |

【Model Download】

from modelscope import snapshot_download
snapshot_download('tclf90/MiniMax-M2.5-AWQ', cache_dir="your_local_path")

【Overview】

<div align="center"> <svg width="60%" height="auto" viewBox="0 0 144 48" fill="none" xmlns="http://www.w3.org/2000/svg"> <path d="M26.6782 7.96523C26.6782 7.02436 25.913 6.26087 24.9739 6.26087C24.0348 6.26087 23.2695 7.0261 23.2695 7.96523V36.2139C23.2695 38.4 21.4904 40.1791 19.3043 40.1791C17.1183 40.1791 15.3391 38.4 15.3391 36.2139V18.0904C15.3391 17.1496 14.5739 16.3861 13.6348 16.3861C12.6956 16.3861 11.9304 17.1513 11.9304 18.0904V25.7722C11.9304 27.9583 10.1513 29.7374 7.96518 29.7374C5.7791 29.7374 4 27.9583 4 25.7722V22.9878C4 22.3635 4.50609 21.8574 5.13043 21.8574C5.75478 21.8574 6.26087 22.3635 6.26087 22.9878V25.7722C6.26087 26.713 7.02605 27.4765 7.96518 27.4765C8.90431 27.4765 9.66954 26.7113 9.66954 25.7722V18.0904C9.66954 15.9044 11.4487 14.1252 13.6348 14.1252C15.8209 14.1252 17.6 15.9044 17.6 18.0904V36.2139C17.6 37.1548 18.3652 37.9183 19.3043 37.9183C20.2435 37.9183 21.0087 37.153 21.0087 36.2139V25.1322V7.96523C21.0087 5.77914 22.7878 4 24.9739 4C27.16 4 28.9391 5.77914 28.9391 7.96523V31.3565C28.9391 31.9809 28.433 32.487 27.8087 32.487C27.1843 32.487 26.6782 31.9809 26.6782 31.3565V7.96523ZM47.6539 14.1252C45.4678 14.1252 43.6887 15.9044 43.6887 18.0904V33.2296C43.6887 34.1704 42.9235 34.9339 41.9843 34.9339C41.0452 34.9339 40.28 34.1687 40.28 33.2296V7.96523C40.28 5.77914 38.5008 4 36.3148 4C34.1287 4 32.3496 5.77914 32.3496 7.96523V40.0348C32.3496 40.9756 31.5843 41.7391 30.6452 41.7391C29.7061 41.7391 28.9409 40.9739 28.9409 40.0348V36.0643C28.9409 35.44 28.4348 34.9339 27.8104 34.9339C27.1861 34.9339 26.68 35.44 26.68 36.0643V40.0348C26.68 42.2209 28.4591 44 30.6452 44C32.8313 44 34.6104 42.2209 34.6104 40.0348V7.96523C34.6104 7.02436 35.3756 6.26087 36.3148 6.26087C37.2539 6.26087 38.0191 7.0261 38.0191 7.96523V33.2296C38.0191 35.4156 39.7982 37.1948 41.9843 37.1948C44.1704 37.1948 45.9496 35.4156 45.9496 33.2296V18.0904C45.9496 17.1496 46.7148 16.3861 47.6539 16.3861C48.593 16.3861 49.3582 17.1513 49.3582 18.0904V31.3565C49.3582 31.9809 49.8643 32.487 50.4887 32.487C51.113 32.487 51.6191 31.9809 51.6191 31.3565V18.0904C51.6191 15.9044 49.84 14.1252 47.6539 14.1252Z" fill="url(#paint0_linear_17_483)"/> <path d="M68.7671 16.5615H71.2541C71.3254 16.5615 71.3845 16.5859 71.435 16.6363C71.4836 16.6868 71.5097 16.7459 71.5097 16.8172V31.1824C71.5097 31.2537 71.4854 31.3128 71.435 31.3633C71.3845 31.4137 71.3254 31.4381 71.2541 31.4381H68.7671C68.6958 31.4381 68.6367 31.4137 68.5862 31.3633C68.5358 31.3146 68.5115 31.2537 68.5115 31.1824V21.812C68.5115 21.7563 68.4976 21.7268 68.4697 21.7268C68.4419 21.7268 68.4123 21.7476 68.3845 21.7911L66.1323 25.318C66.061 25.4311 65.9619 25.4885 65.8349 25.4885H64.581C64.4541 25.4885 64.3549 25.4328 64.2836 25.318L62.0315 21.7911C62.0036 21.7494 61.9741 21.7302 61.9462 21.7372C61.9184 21.7441 61.9045 21.7772 61.9045 21.8328V31.1824C61.9045 31.2537 61.8802 31.3128 61.8297 31.3633C61.7793 31.4137 61.7202 31.4381 61.6489 31.4381H59.1619C59.0906 31.4381 59.0315 31.4137 58.981 31.3633C58.9306 31.3146 58.9062 31.2537 58.9062 31.1824V16.8172C58.9062 16.7459 58.9306 16.6868 58.981 16.6363C59.0315 16.5859 59.0906 16.5615 59.1619 16.5615H61.6489C61.7758 16.5615 61.8749 16.6189 61.9462 16.732L65.1341 21.6833C65.1758 21.7685 65.2193 21.7685 65.261 21.6833L68.4697 16.732C68.541 16.6189 68.6402 16.5615 68.7671 16.5615Z" fill="currentColor"/> <path d="M74.1764 31.3633C74.1259 31.3146 74.1016 31.2537 74.1016 31.1824V16.8172C74.1016 16.7459 74.1259 16.6868 74.1764 16.6363C74.2268 16.5859 74.2859 16.5615 74.3572 16.5615H76.8442C76.9155 16.5615 76.9746 16.5859 77.0251 16.6363C77.0737 16.6868 77.0998 16.7459 77.0998 16.8172V31.1824C77.0998 31.2537 77.0755 31.3128 77.0251 31.3633C76.9746 31.4137 76.9155 31.4381 76.8442 31.4381H74.3572C74.2859 31.4381 74.2268 31.4137 74.1764 31.3633Z" fill="currentColor"/> <path d="M88.3066 16.6361C88.3553 16.5874 88.4162 16.5613 88.4875 16.5613H90.9744C91.0457 16.5613 91.1049 16.5857 91.1553 16.6361C91.204 16.6865 91.2301 16.7457 91.2301 16.817V31.1822C91.2301 31.2535 91.2057 31.3126 91.1553 31.363C91.1049 31.4135 91.0457 31.4378 90.9744 31.4378H88.5727C88.4301 31.4378 88.331 31.3822 88.2753 31.2674L82.771 22.1717C82.7431 22.13 82.7136 22.1109 82.6858 22.1178C82.6579 22.1248 82.644 22.1578 82.644 22.2135L82.6858 31.1805C82.6858 31.2518 82.6614 31.3109 82.611 31.3613C82.5606 31.4117 82.5014 31.4361 82.4301 31.4361H79.9431C79.8718 31.4361 79.8127 31.4117 79.7623 31.3613C79.7118 31.3126 79.6875 31.2518 79.6875 31.1805V16.8152C79.6875 16.7439 79.7118 16.6848 79.7623 16.6344C79.8127 16.5839 79.8718 16.5596 79.9431 16.5596H82.3449C82.4858 16.5596 82.5849 16.617 82.6423 16.73L88.124 25.7822C88.1518 25.8239 88.1797 25.8431 88.2092 25.8361C88.2371 25.8292 88.251 25.7978 88.251 25.7404L88.2301 16.8152C88.2301 16.7439 88.2545 16.6848 88.3049 16.6344L88.3066 16.6361Z" fill="currentColor"/> <path d="M93.8951 31.3633C93.8446 31.3146 93.8203 31.2537 93.8203 31.1824V16.8172C93.8203 16.7459 93.8446 16.6868 93.8951 16.6363C93.9455 16.5859 94.0047 16.5615 94.076 16.5615H96.5629C96.6342 16.5615 96.6934 16.5859 96.7438 16.6363C96.7925 16.6868 96.8186 16.7459 96.8186 16.8172V31.1824C96.8186 31.2537 96.7942 31.3128 96.7438 31.3633C96.6934 31.4137 96.6342 31.4381 96.5629 31.4381H94.076C94.0047 31.4381 93.9455 31.4137 93.8951 31.3633Z" fill="currentColor"/> <path d="M109.267 16.5615H111.754C111.825 16.5615 111.885 16.5859 111.935 16.6363C111.984 16.6868 112.01 16.7459 112.01 16.8172V31.1824C112.01 31.2537 111.985 31.3128 111.935 31.3633C111.885 31.4137 111.825 31.4381 111.754 31.4381H109.267C109.196 31.4381 109.137 31.4137 109.086 31.3633C109.036 31.3146 109.011 31.2537 109.011 31.1824V21.812C109.011 21.7563 108.998 21.7268 108.97 21.7268C108.942 21.7268 108.912 21.7476 108.885 21.7911L106.632 25.318C106.561 25.4311 106.462 25.4885 106.335 25.4885H105.081C104.954 25.4885 104.855 25.4328 104.784 25.318L102.531 21.7911C102.504 21.7494 102.474 21.7302 102.446 21.7372C102.418 21.7441 102.405 21.7772 102.405 21.8328V31.1824C102.405 31.2537 102.38 31.3128 102.33 31.3633C102.279 31.4137 102.22 31.4381 102.149 31.4381H99.6619C99.5906 31.4381 99.5315 31.4137 99.481 31.3633C99.4306 31.3146 99.4062 31.2537 99.4062 31.1824V16.8172C99.4062 16.7459 99.4306 16.6868 99.481 16.6363C99.5315 16.5859 99.5906 16.5615 99.6619 16.5615H102.149C102.276 16.5615 102.375 16.6189 102.446 16.732L105.634 21.6833C105.676 21.7685 105.719 21.7685 105.761 21.6833L108.97 16.732C109.041 16.6189 109.14 16.5615 109.267 16.5615Z" fill="currentColor"/> <path d="M123.782 31.2241L123.144 29.1424C123.116 29.0867 123.079 29.0572 123.038 29.0572H117.81C117.768 29.0572 117.732 29.085 117.704 29.1424L117.088 31.2241C117.046 31.3668 116.954 31.4363 116.812 31.4363H114.112C114.027 31.4363 113.963 31.412 113.921 31.3615C113.879 31.3128 113.871 31.2381 113.9 31.1389L118.49 16.7737C118.532 16.6328 118.624 16.5615 118.766 16.5615H122.102C122.243 16.5615 122.335 16.6328 122.379 16.7737L126.968 31.1389C126.982 31.1668 126.989 31.2033 126.989 31.245C126.989 31.372 126.911 31.4363 126.756 31.4363H124.057C123.916 31.4363 123.824 31.365 123.78 31.2241H123.782ZM118.554 26.7407H122.295C122.38 26.7407 122.408 26.6989 122.38 26.6137L120.467 20.3024C120.453 20.2467 120.432 20.2207 120.403 20.2276C120.375 20.2346 120.352 20.2589 120.339 20.3024L118.469 26.6137C118.455 26.6989 118.483 26.7407 118.554 26.7407Z" fill="currentColor"/> <path d="M128.222 31.353C128.18 31.2974 128.187 31.2261 128.243 31.1409L132.365 24.0643C132.393 24.0226 132.393 23.9791 132.365 23.9374L128.243 16.8609L128.201 16.7339C128.201 16.6209 128.28 16.5635 128.434 16.5635H131.133C131.274 16.5635 131.38 16.6209 131.452 16.7339L134.213 21.6C134.255 21.6852 134.299 21.6852 134.34 21.6L137.102 16.7339C137.173 16.6209 137.28 16.5635 137.42 16.5635H140.099C140.198 16.5635 140.269 16.5913 140.311 16.6487C140.353 16.7061 140.346 16.7756 140.29 16.8609L136.168 23.9374C136.154 23.9791 136.154 24.0226 136.168 24.0643L140.29 31.1409L140.332 31.2678C140.332 31.3809 140.253 31.4383 140.099 31.4383H137.42C137.278 31.4383 137.172 31.3826 137.102 31.2678L134.34 26.4226C134.299 26.3374 134.255 26.3374 134.213 26.4226L131.429 31.2678C131.358 31.3809 131.252 31.4383 131.111 31.4383H128.433C128.333 31.4383 128.262 31.4104 128.22 31.353H128.222Z" fill="currentColor"/> <defs> <linearGradient id="paint0_linear_17_483" x1="3.99826" y1="24" x2="51.6208" y2="24" gradientUnits="userSpaceOnUse"> <stop stop-color="#E21680"/> <stop offset="1" stop-color="#FF633A"/> </linearGradient> </defs> </svg> </div> <hr> <div align="center" style="line-height: 1.4; font-size:16px; margin-top: 30px;"> Join Our <a href="https://platform.minimaxi.com/docs/faq/contact-us" target="_blank" style="font-size:17px; margin: 2px;"> 💬 WeChat </a> | <a href="https://discord.com/invite/hvvt8hAye6" target="_blank" style="font-size:17px; margin: 2px;"> 🧩 Discord </a> community. </div> <div align="center" style="line-height: 1.2; font-size:16px;"> <a href="https://agent.minimax.io/" target="_blank" style="display: inline-block; margin: 4px;"> MiniMax Agent </a> | <a href="https://platform.minimax.io/docs/guides/text-generation" target="_blank" style="display: inline-block; margin: 4px;"> ⚡️ API </a> | <a href="https://github.com/MiniMax-AI/MiniMax-MCP" style="display: inline-block; margin: 4px;"> MCP </a> | <a href="https://www.minimax.io" target="_blank" style="display: inline-block; margin: 4px;"> MiniMax Website </a> </div> <div align="center" style="line-height: 1.2; font-size:16px; margin-bottom: 30px;"> <a href="https://huggingface.co/MiniMaxAI" target="_blank" style="margin: 2px;"> 🤗 Hugging Face </a> | <a href="https://github.com/MiniMax-AI/MiniMax-M2.1" target="_blank" style="margin: 2px;"> 🐙 GitHub </a> | <a href="https://www.modelscope.cn/organization/MiniMax" target="_blank" style="margin: 2px;"> 🤖️ ModelScope </a> | <a href="https://github.com/MiniMax-AI/MiniMax-M2.5/blob/main/LICENSE" style="margin: 2px;"> 📄 License: Modified-MIT </a> </div> <p align="center"> <picture> <img class="hidden dark:block" width="100%" src="figures/bench_11.png"> <img class="dark:hidden" width="100%" src="figures/bench_12.png"> </picture> </p>

Today we're introducing our latest model, MiniMax-M2.5.

Extensively trained with reinforcement learning in hundreds of thousands of complex real-world environments, M2.5 is SOTA in coding, agentic tool use and search, office work, and a range of other economically valuable tasks, boasting scores of 80.2% in SWE-Bench Verified, 51.3% in Multi-SWE-Bench, and 76.3% in BrowseComp (with context management).

Trained to reason efficiently and decompose tasks optimally, M2.5 exhibits tremendous speed in performing complicated agentic tasks, completing the SWE-Bench Verified evaluation 37% faster than M2.1, matching the speed of Claude Opus 4.6.

M2.5 is the first frontier model where users do not need to worry about cost, delivering on the promise of intelligence too cheap to meter. It costs just $1 to run the model continuously for an hour at a rate of 100 tokens per second. At 50 tokens per second, the cost drops to $0.30. We hope that the speed and cost effectiveness of M2.5 enable innovative new agentic applications.

Coding

In programming evaluations, MiniMax-M2.5 saw substantial improvements compared to previous generations, reaching SOTA levels. The performance of M2.5 in multilingual tasks is especially pronounced.

<p align="center"> <picture> <img class="hidden dark:block" width="100%" src="figures/bench_2.png"> <img class="dark:hidden" width="100%" src="figures/bench_1.png"> </picture> </p>

A significant improvement from previous generations is M2.5's ability to think and plan like an architect. The Spec-writing tendency of the model emerged during training: before writing any code, M2.5 actively decomposes and plans the features, structure, and UI design of the project from the perspective of an experienced software architect.

M2.5 was trained on over 10 languages (including Go, C, C++, TypeScript, Rust, Kotlin, Python, Java, JavaScript, PHP, Lua, Dart, and Ruby) across more than 200,000 real-world environments. Going far beyond bug-fixing, M2.5 delivers reliable performance across the entire development lifecycle of complex systems: from 0-to-1 system design and environment setup, to 1-to-10 system development, to 10-to-90 feature iteration, and finally 90-to-100 comprehensive code review and system testing. It covers full-stack projects spanning multiple platforms including Web, Android, iOS, and Windows, encompassing server-side APIs, business logic, databases, and more, not just frontend webpage demos.

To evaluate these capabilities, we also upgraded the VIBE benchmark to a more complex and challenging Pro version, significantly increasing task complexity, domain coverage, and evaluation accuracy. Overall, M2.5 performs on par with Opus 4.5.

<p align="center"> <picture> <img class="hidden dark:block" width="100%" src="figures/bench_4.png"> <img class="dark:hidden" width="100%" src="figures/bench_3.png"> </picture> </p>

We focused on the model's ability to generalize across out-of-distribution harnesses. We tested performance on the SWE-Bench Verified evaluation set using different coding agent harnesses.

  • On Droid: 79.7(M2.5) > 78.9(Opus 4.6)
  • On OpenCode: 76.1(M2.5) > 75.9(Opus 4.6)

Search and Tool calling

<p align="center"> <picture> <img class="hidden dark:block" width="100%" src="figures/bench_6.png"> <img class="dark:hidden" width="100%" src="figures/bench_5.png"> </picture> </p>

Effective tool calling and search are prerequisites for a model's ability to autonomously handle more complex tasks. In evaluations on benchmarks such as BrowseComp and Wide Search, M2.5 achieved industry-leading performance. At the same time, the model's generalization has also improved — M2.5 demonstrates more stable performance when facing unfamiliar scaffolding environments.

In research tasks performed by professional human experts, using a search engine is only a small part of the process; most of the work involves deep exploration across information-dense webpages. To address this, we built RISE (Realistic Interactive Search Evaluation) to measure a model's search capabilities on real-world professional tasks. The results show that M2.5 excels at expert-level search tasks in real-world settings.

Compared to its predecessors, M2.5 also demonstrates much better decision-making when handling agentic tasks: it has learned to solve problems with more precise search rounds and better token efficiency. For example, across multiple agentic tasks including BrowseComp, Wide Search, and RISE, M2.5 achieved better results with fewer rounds, using approximately 20% fewer rounds compared to M2.1. This indicates that the model is no longer just getting the answer right, but is also reasoning towards results in more efficient paths.

Office work

M2.5 was trained to produce truly deliverable outputs in office scenarios. To this end, we engaged in thorough collaboration with senior professionals in fields such as finance, law, and social sciences. They designed requirements, provided feedback, participated in defining standards, and directly contributed to data construction, bringing the tacit knowledge of their industries into the model's training pipeline. Based on this foundation, M2.5 has achieved significant capability improvements in high-value workspace scenarios such as Word, PowerPoint, and Excel financial modeling. On the evaluation side, we built an internal Cowork Agent evaluation framework (GDPval-MM) that assesses both the quality of the deliverable and the professionalism of the agent's trajectory through pairwise comparisons, while also monitoring token costs across the entire workflow to estimate the model's real-world productivity gains. In comparisons against other mainstream models, it achieved an average win rate of 59.0%.

<p align="center"> <picture> <img class="hidden dark:block" width="100%" src="figures/bench_8.png"> <img class="dark:hidden" width="100%" src="figures/bench_7.png"> </picture> </p>

Efficiency

Because the real world is full of deadlines and time constraints, task completion speed is a practical necessity. The time it takes a model to complete a task depends on its task decomposition effectiveness, token efficiency, and inference speed. M2.5 is served natively at a rate of 100 tokens per second, which is nearly twice that of other frontier models. Further, our reinforcement learning setup incentivizes the model to reason efficiently and break down tasks optimally. Due to these three factors, M2.5 delivers a significant time savings in complex task completion.

For example, when running SWE-Bench Verified, M2.5 consumed an average of 3.52 million tokens per task. In comparison, M2.1 consumed 3.72M tokens. Meanwhile, thanks to improvements in capabilities such as parallel tool calling, the end-to-end runtime decreased from an average of 31.3 minutes to 22.8 minutes, representing a 37% speed improvement. This runtime is on par with Claude Opus 4.6's 22.9 minutes, while the total cost per task is only 10% that of Claude Opus 4.6.

Cost

Our goal in designing the M2-series of foundation models is to power complex agents without having to worry about cost. We believe that M2.5 is close to realizing this goal. We’re releasing two versions of the model, M2.5 and M2.5-Lightning, that are identical in capability but differ in speed. M2.5-Lightning has a steady throughput of 100 tokens per second, which is two times faster than other frontier models, and costs $0.3 per million input tokens and $2.4 per million output tokens. M2.5, which has a throughput of 50 tokens per second, costs half that. Both model versions support caching. Based on output price, the cost of M2.5 is one-tenth to one-twentieth that of Opus, Gemini 3 Pro, and GPT-5.

At a rate of 100 output tokens per second, running M2.5 continuously for an hour costs $1. At a rate of 50 TPS, the price drops to $0.3. To put that into perspective, you can have four M2.5 instances running continuously for an entire year for $10,000. We believe that M2.5 provides virtually limitless possibilities for the development and operation of agents in the economy. For the M2-series, the only problem that remains is how to continually push the frontier of model capability.

Improvement Rate

Over the three and a half months from late October to now, we have successively released M2, M2.1, and M2.5, with the pace of model improvement exceeding our original expectations. For instance, in the highly-regarded SWE-Bench Verified benchmark, the rate of progress of the M2-series has been significantly faster than that of peers such as the Claude, GPT, and Gemini model families.

<p align="center"> <img width="100%" src="figures/bench_10.png"> </p>

RL Scaling

One of the key drivers of the aforementioned developments is the scaling of reinforcement learning. As we train our models, we also benefit from their abilities. Most of the tasks and workspaces that we perform in our company have been made into training environments for RL. To date, there are already hundreds of thousands of such environments. At the same time, we did plenty of work on our agentic RL framework, algorithms, reward signals, and infrastructure engineering to support the continued scaling of our RL training.

Forge –– Agent-Native RL Framework

We designed an agent-native RL framework in-house, called Forge, which introduces an intermediary layer that fully decouples the underlying training-inference engine from the agent, supporting the integration of arbitrary agents and enabling us to optimize the model's generalization across agent scaffolds and tools. To improve system throughput, we optimized asynchronous scheduling strategies to balance system throughput against sample off-policyness, and designed a tree-structured merging strategy for training samples, achieving approximately 40x training speedup.

<p align="center"> <img width="60%" src="figures/rl_1.png"> </p>

Agentic RL Algorithm and Reward Design

On the algorithm side, we continued using the CISPO algorithm we proposed at the beginning of last year to ensure the stability of MoE models during large-scale training. To address the credit assignment challenge posed by long contexts in agent rollouts, we introduced a process reward mechanism for end-to-end monitoring of generation quality. Furthermore, to deeply align with user experience, we evaluated task completion time through agent trajectories, achieving an optimal trade-off between model intelligence and response speed.

<p align="center"> <img width="60%" src="figures/rl_2.png"> </p>

We will release a more comprehensive introduction to RL scaling soon in a separate technical blogpost.

MiniMax Agent: M2.5 as a Professional Employee

M2.5 has been fully deployed in MiniMax Agent, delivering the best agentic experience.

We have distilled core information-processing capabilities into standardized Office Skills deeply integrated within MiniMax Agent. In MAX mode, when handling tasks such as Word formatting, PowerPoint editing, and Excel calculations, MiniMax Agent automatically loads the corresponding Office Skills based on file type, improving the quality of task outputs.

Furthermore, users can combine Office Skills with domain-specific industry expertise to create reusable Experts tailored to specific task scenarios.

Take industry research as an example: by merging a mature research framework SOP (standard operating procedure) with Word Skills, the Agent can strictly follow the established framework to automatically fetch data, organize analytical logic, and output properly formatted research reports — rather than merely generating a raw block of text. In financial modeling scenarios, by combining an organization's proprietary modeling standards with Excel Skills, the Agent can follow specific risk control logic and calculation standards to automatically generate and validate complex financial models, rather than simply outputting a basic spreadsheet.

To date, users have built over 10,000 Experts on MiniMax Agent, and this number is still growing rapidly. MiniMax has also built multiple sets of deeply optimized, ready-to-use Expert suites on MiniMax Agent for high-frequency scenarios such as office work, finance, and programming.

MiniMax itself has been among the first to benefit from M2.5's capabilities. Throughout the company's daily operations, 30% of overall tasks are autonomously completed by M2.5, spanning functions including R&D, product, sales, HR, and finance — and the penetration rate continues to rise. Performance in coding scenarios has been particularly notable, with M2.5-generated code accounting for 80% of newly committed code.

How to Use

MiniMax Agent: https://agent.minimax.io/

MiniMax API Platform: https://platform.minimax.io/

MiniMax Coding Plan: https://platform.minimax.io/subscribe/coding-plan

Local Deployment Guide

Download the model from HuggingFace repository: https://huggingface.co/MiniMaxAI/MiniMax-M2.5

We recommend using the following inference frameworks (listed alphabetically) to serve the model:

SGLang

We recommend using SGLang to serve MiniMax-M2.5. Please refer to our SGLang Deployment Guide.

vLLM

We recommend using vLLM to serve MiniMax-M2.5. Please refer to our vLLM Deployment Guide.

Transformers

We recommend using Transformers to serve MiniMax-M2.5. Please refer to our Transformers Deployment Guide.

KTransformers

We recommend using KTransformers to serve MiniMax-M2.5. Please refer to KTransformers Deployment Guide

Inference Parameters

We recommend using the following parameters for best performance: temperature=1.0, top_p = 0.95, top_k = 40. Default system prompt:

You are a helpful assistant. Your name is MiniMax-M2.5 and is built by MiniMax.

Tool Calling Guide

Please refer to our Tool Calling Guide.

Contact Us

Contact us at model@minimax.io.

Appendix

Further benchmark results of M2.5:

| Benchmark | MiniMax-M2.5 | MiniMax-M2.1 | Claude Sonnet 4.5 | Claude Opus 4.5 | Claude Opus 4.6 | Gemini 3 Pro | GPT-5.2 (thinking) | |---|---|---|---|---|---|---|---| | AIME25 | 86.3 | 83.0 | 88.0 | 91.0 | 95.6 | 96.0 | 98.0 | | GPQA-D | 85.2 | 83.0 | 83.0 | 87.0 | 90.0 | 91.0 | 90.0 | | HLE w/o tools | 19.4 | 22.2 | 17.3 | 28.4 | 30.7 | 37.2 | 31.4 | | SciCode | 44.4 | 41.0 | 45.0 | 50.0 | 52.0 | 56.0 | 52.0 | | IFBench | 70.0 | 70.0 | 57.0 | 58.0 | 53.0 | 70.0 | 75.0 | | AA-LCR | 69.5 | 62.0 | 66.0 | 74.0 | 71.0 | 71.0 | 73.0 |

Evaluation methods:

  • SWE benchmark: SWE-bench Verified, SWE-bench Multilingual, SWE-bench-pro, and Multi-SWE-bench were tested on internal infrastructure using Claude Code as the scaffolding, with the default system prompt overridden, and results averaged over 4 runs. Additionally, SWE-bench Verified was also evaluated on the Droid and Opencode scaffoldings using the default prompt.
  • Terminal Bench 2: We tested Terminal Bench 2 using Claude Code 2.0.64 as the evaluation scaffolding. We modified the Dockerfiles of some problems to ensure the correctness of the problems themselves, uniformly expanded sandbox specifications to 8-core CPU and 16 GB memory, set the timeout uniformly to 7,200 seconds, and equipped each problem with a basic toolset (ps, curl, git, etc.). While not retrying on timeouts, we added a detection mechanism for empty scaffolding responses, retrying tasks whose final response was empty to handle various abnormal interruption scenarios. Final results are averaged over 4 runs.
  • VIBE-Pro: Internal benchmark. Uses Claude Code as the scaffolding to automatically verify the interaction logic and visual effects of programs. All scores are computed through a unified pipeline that includes a requirements set, containerized deployment, and a dynamic interaction environment. Final results are averaged over 3 runs.
  • BrowseComp: Uses the same agent framework as WebExplorer (Liu et al., 2025). When token usage exceeds 30% of the maximum context, all history is discarded.
  • Wide Search: Uses the same agent framework as WebExplorer (Liu et al., 2025).
  • RISE: Internal benchmark. Contains real questions from human experts, evaluating the model's multi-step information retrieval and reasoning capabilities when combined with complex web interactions. A Playwright-based browser tool suite is added on top of the WebExplorer (Liu et al., 2025) agent framework.
  • GDPval-MM: Internal benchmark. Based on the open-source GDPval test set, using a custom agentic evaluation framework where an LLM-as-a-judge performs pairwise win/tie/loss judgments on complete trajectories. Average token cost per task is calculated based on each vendor's official API pricing (without caching).
  • MEWC: Internal benchmark. Built on MEWC (Microsoft Excel World Championship), comprising 179 problems from the main and other regional divisions of Excel esports competitions from 2021–2026. It evaluates the model's ability to understand competition Excel spreadsheets and use Excel tools to complete problems. Scores are calculated by comparing output and answer cell values one by one.
  • Finance Modeling: Internal benchmark. Primarily contains financial modeling problems constructed by industry experts, involving end-to-end research and analysis tasks performed via Excel tools. Each problem is scored using expert-designed rubrics. Final results are averaged over 3 runs.
  • AIME25 ~ AA-LCR: Obtained through internal testing based on the public evaluation sets and evaluation methods covered by the Artificial Analysis Intelligence Index leaderboard.

Author: QuantTrio

Likes: 2

Downloads: 0

Tags: transformers, safetensors, minimax_m2, text-generation, vLLM, AWQ, conversational, custom_code, license:other, endpoints_compatible, 4-bit, awq, region:us

freddm/JoyAI-LLM-Flash-GGUF


language:

  • zh
  • en pipeline_tag: text-generation

<div align="center"> <picture> <img src="figures/joyai-logo.png" width="30%" alt="JoyAI-LLM Flash"> </picture> </div> <hr> <div align="center" style="line-height: 1;"> <a href="https://huggingface.co/jdopensource" target="_blank"><img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-JD-ffc107?color=ffc107&logoColor=white"/></a> <a href="https://huggingface.co/jdopensource/JoyAI-LLM-Flash/blob/main/LICENSE"><img alt="License" src="https://img.shields.io/badge/License-Modified_MIT-f5de53?&color=f5de53"/></a> </div>

1. Model Introduction

JoyAI-LLM Flash is a state-of-the-art medium-sized instruct language model with 3 billion activated parameters and 48 billion total parameters. JoyAI-LLM Flash was pretrained on 20 trillion text tokens using Muon optimizer, followed by large-scale supervised fine-tuning (SFT), direct preference optimization (DPO), and reinforcement learning (RL) across diverse environments. JoyAI-LLM Flash achieves strong performance across frontier knowledge, reasoning, coding tasks and agentic capabilities.

Key Features

  • Fiber Bundle RL: Introduces fiber bundle theory into reinforcement learning, proposing a novel optimization framework, FiberPO. This method is specifically designed to handle the challenges of large-scale and heterogeneous agent training, improving stability and robustness under complex data distributions.
  • Training-Inference Collaboration: apply Muon optimizer with dense MTP, develop novel optimization techniques to resolve instabilities while scaling up, delivering 1.3× to 1.7× the throughput of the non-MTP version.
  • Agentic Intelligence: designed for tool use, reasoning, and autonomous problem-solving.

2. Model Summary

| | | | :-----------------------------------------: | :----------------------: | | Architecture | Mixture-of-Experts (MoE) | | Total Parameters | 48B | | Activated Parameters | 3B | | Number of Layers (Dense layer included) | 40 | | Number of Dense Layers | 1 | | Attention Hidden Dimension | 2048 | | MoE Hidden Dimension (per Expert) | 768 | | Number of Attention Heads | 32 | | Number of Experts | 256 | | Selected Experts per Token | 8 | | Number of Shared Experts | 1 | | Vocabulary Size | 129K | | Context Length | 128K | | Attention Mechanism | MLA | | Activation Function | SwiGLU | | </div> | |

3. Evaluation Results

<table> <thead> <tr> <th align="center">Benchmark</th> <th align="center"><sup>JoyAI-LLM Flash</sup></th> <th align="center"><sup>Qwen3-30B-A3B-Instuct-2507</sup></th> <th align="center"><sup>GLM-4.7-Flash<br>(Non-thinking)</sup></th> </tr> </thead> <tbody> <tr> <td align="center" colspan=8><strong>Knowledge &amp; Alignment</strong></td> </tr> <tr> <td align="center" style="vertical-align: middle">MMLU</td> <td align="center" style="vertical-align: middle"><strong>89.50</strong></td> <td align="center" style="vertical-align: middle">86.87</td> <td align="center" style="vertical-align: middle">80.53</td> </tr> <tr> <td align="center" style="vertical-align: middle">MMLU-Pro</td> <td align="center" style="vertical-align: middle"><strong>81.02</strong></td> <td align="center" style="vertical-align: middle">73.88</td> <td align="center" style="vertical-align: middle">63.62</td> </tr> <tr> <td align="center" style="vertical-align: middle">CMMLU</td> <td align="center" style="vertical-align: middle"><strong>87.03</strong></td> <td align="center" style="vertical-align: middle">85.88</td> <td align="center" style="vertical-align: middle">75.85</td> </tr> <tr> <td align="center" style="vertical-align: middle">GPQA-Diamond</td> <td align="center" style="vertical-align: middle"><strong>74.43</strong></td> <td align="center" style="vertical-align: middle">68.69</td> <td align="center" style="vertical-align: middle">39.90</td> </tr> <tr> <td align="center" style="vertical-align: middle">SuperGPQA</td> <td align="center" style="vertical-align: middle"><strong>55.00</strong></td> <td align="center" style="vertical-align: middle">52.00</td> <td align="center" style="vertical-align: middle">32.00</td> </tr> <tr> <td align="center" style="vertical-align: middle">LiveBench</td> <td align="center" style="vertical-align: middle"><strong>72.90</strong></td> <td align="center" style="vertical-align: middle">59.70</td> <td align="center" style="vertical-align: middle">43.10</td> </tr> <tr> <td align="center" style="vertical-align: middle">IFEval</td> <td align="center" style="vertical-align: middle"><strong>86.69</strong></td> <td align="center" style="vertical-align: middle">83.18</td> <td align="center" style="vertical-align: middle">82.44</td> </tr> <tr> <td align="center" style="vertical-align: middle">AlignBench</td> <td align="center" style="vertical-align: middle"><strong>8.24</strong></td> <td align="center" style="vertical-align: middle">8.07</td> <td align="center" style="vertical-align: middle">6.85</td> </tr> <tr> <td align="center" style="vertical-align: middle">HellaSwag</td> <td align="center" style="vertical-align: middle"><strong>91.79</strong></td> <td align="center" style="vertical-align: middle">89.90</td> <td align="center" style="vertical-align: middle">60.84</td> </tr> <tr> <td align="center" colspan=8><strong>Coding</strong></td> </tr> <tr> <td align="center" style="vertical-align: middle">HumanEval</td> <td align="center" style="vertical-align: middle"><strong>96.34</strong></td> <td align="center" style="vertical-align: middle">95.12</td> <td align="center" style="vertical-align: middle">74.39</td> </tr> <tr> <td align="center" style="vertical-align: middle">LiveCodeBench</td> <td align="center" style="vertical-align: middle"><strong>65.60</strong></td> <td align="center" style="vertical-align: middle">39.71</td> <td align="center" style="vertical-align: middle">27.43</td> </tr> <tr> <td align="center" style="vertical-align: middle">SciCode</td> <td align="center" style="vertical-align: middle"><strong>3.08/22.92</strong></td> <td align="center" style="vertical-align: middle"><strong>3.08/22.92</strong></td> <td align="center" style="vertical-align: middle">3.08/15.11</td> </tr> <tr> <td align="center" colspan=8><strong>Mathematics</strong></td> </tr> <tr> <td align="center" style="vertical-align: middle">GSM8K</td> <td align="center" style="vertical-align: middle"><strong>95.83</strong></td> <td align="center" style="vertical-align: middle">79.83</td> <td align="center" style="vertical-align: middle">81.88</td> </tr> <tr> <td align="center" style="vertical-align: middle">AIME2025</td> <td align="center" style="vertical-align: middle"><strong>65.83</strong></td> <td align="center" style="vertical-align: middle">62.08</td> <td align="center" style="vertical-align: middle">24.17</td> </tr> <tr> <td align="center" style="vertical-align: middle">MATH 500</td> <td align="center" style="vertical-align: middle"><strong>97.10</strong></td> <td align="center" style="vertical-align: middle">89.80</td> <td align="center" style="vertical-align: middle">90.90</td> </tr> <tr> <td align="center" colspan=8><strong>Agentic</strong></td> </tr> <tr> <td align="center" style="vertical-align: middle">SWE-bench Verified</td> <td align="center" style="vertical-align: middle"><strong>60.60</strong></td> <td align="center" style="vertical-align: middle">24.44</td> <td align="center" style="vertical-align: middle">51.60</td> </tr> <tr> <td align="center" style="vertical-align: middle">Tau2-Retail</td> <td align="center" style="vertical-align: middle"><strong>67.55</strong></td> <td align="center" style="vertical-align: middle">53.51</td> <td align="center" style="vertical-align: middle">62.28</td> </tr> <tr> <td align="center" style="vertical-align: middle">Tau2-Airline</td> <td align="center" style="vertical-align: middle"><strong>54.00</strong></td> <td align="center" style="vertical-align: middle">32.00</td> <td align="center" style="vertical-align: middle">52.00</td> </tr> <tr> <td align="center" style="vertical-align: middle">Tau2-Telecom</td> <td align="center" style="vertical-align: middle">79.83</td> <td align="center" style="vertical-align: middle">4.39</td> <td align="center" style="vertical-align: middle"><strong>88.60</strong></td> </tr> <tr> <td align="center" colspan=8><strong>Long Context</strong></td> </tr> <tr> <td align="center" style="vertical-align: middle">RULER</td> <td align="center" style="vertical-align: middle"><strong>95.60</strong></td> <td align="center" style="vertical-align: middle">89.66</td> <td align="center" style="vertical-align: middle">56.12</td> </tr> </tbody> </table>

4. Deployment

[!Note] You can access JoyAI-LLM Flash API on https://docs.jdcloud.com/cn/jdaip/chat and we provide OpenAI/Anthropic-compatible API for you. Currently, JoyAI-LLM Flash is recommended to run on the following inference engines:

  • vLLM
  • SGLang

The minimum version requirement for transformers is 4.57.1.

Deployment examples can be found in the Model Deployment Guide.

5. Model Usage

The usage demos below demonstrate how to call our official API.

For third-party APIs deployed with vLLM or SGLang, please note that:

[!Note] Recommended sampling parameters: temperature=0.6, top_p=1.0

Chat Completion

This is a simple chat completion script which shows how to call JoyAI-Flash API.

from openai import OpenAI

client = OpenAI(base_url="http://IP:PORT/v1", api_key="EMPTY")


def simple_chat(client: OpenAI):
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "which one is bigger, 9.11 or 9.9? think carefully.",
                }
            ],
        },
    ]
    model_name = client.models.list().data[0].id
    response = client.chat.completions.create(
        model=model_name, messages=messages, stream=False, max_tokens=4096
    )
    print(f"response: {response.choices[0].message.content}")


if __name__ == "__main__":
    simple_chat(client)

Tool call Completion

This is a simple toll call completion script which shows how to call JoyAI-Flash API.

import json

from openai import OpenAI

client = OpenAI(base_url="http://IP:PORT/v1", api_key="EMPTY")


def my_calculator(expression: str) -> str:
    return str(eval(expression))


def rewrite(expression: str) -> str:
    return str(expression)


def simple_tool_call(client: OpenAI):
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "use my functions to compute the results for the equations: 6+1",
                },
            ],
        },
    ]
    tools = [
        {
            "type": "function",
            "function": {
                "name": "my_calculator",
                "description": "A calculator that can evaluate a mathematical equation and compute its results.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "expression": {
                            "type": "string",
                            "description": "The mathematical expression to evaluate.",
                        },
                    },
                    "required": ["expression"],
                },
            },
        },
        {
            "type": "function",
            "function": {
                "name": "rewrite",
                "description": "Rewrite a given text for improved clarity",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "text": {
                            "type": "string",
                            "description": "The input text to rewrite",
                        }
                    },
                },
            },
        },
    ]
    model_name = client.models.list().data[0].id
    response = client.chat.completions.create(
        model=model_name,
        messages=messages,
        temperature=1.0,
        max_tokens=1024,
        tools=tools,
        tool_choice="auto",
    )
    tool_calls = response.choices[0].message.tool_calls

    results = []
    for tool_call in tool_calls:
        function_name = tool_call.function.name
        function_args = tool_call.function.arguments
        if function_name == "my_calculator":
            result = my_calculator(**json.loads(function_args))
            results.append(result)
    messages.append({"role": "assistant", "tool_calls": tool_calls})
    for tool_call, result in zip(tool_calls, results):
        messages.append(
            {
                "role": "tool",
                "tool_call_id": tool_call.id,
                "name": tool_call.function.name,
                "content": result,
            }
        )
    response = client.chat.completions.create(
        model=model_name,
        messages=messages,
        temperature=1.0,
        max_tokens=1024,
    )
    print(response.choices[0].message.content)


if __name__ == "__main__":
    simple_tool_call(client)


6. License

Both the code repository and the model weights are released under the Modified MIT License.

Author: freddm

Likes: 2

Downloads: 0

Tags: gguf, text-generation, zh, en, endpoints_compatible, region:us, conversational