Todays AI Summary

AI Developments: Reasoning Models, Domain Adaptation, and Safety-Oriented AI

Here's a look at some of the latest advancements in AI, covering reasoning models, domain adaptation techniques, and safety-focused AI development.

Research Highlights

  • SIDA: Synthetic Image Driven Zero-shot Domain Adaptation: This paper introduces a novel zero-shot domain adaptation method using synthetic images to address the limitations of text-driven approaches. SIDA leverages image translation to capture fine-grained style cues, achieving state-of-the-art performance and improved efficiency in adaptation time.
  • Scenethesis: 3D Software Synthesis Guided by Constraint-Expressive Intermediate Representation: This research presents a new approach to 3D software synthesis, utilizing a domain-specific language called ScenethesisLang. This language serves as a granular, constraint-aware intermediate representation, enabling fine-grained modification of 3D software elements and systematic constraint satisfaction.
  • Moving Out: Physically-grounded Human-AI Collaboration: This paper introduces a new benchmark for human-AI collaboration in physical environments. It also proposes a novel method, BASS (Behavior Augmentation, Simulation, and Selection), to enhance the diversity of agents and their understanding of the outcome of actions.
  • SynC: Synthetic Image Caption Dataset Refinement with One-to-many Mapping for Zero-shot Image Captioning: This paper introduces SynC, a novel framework specifically designed to refine synthetic image-caption datasets for ZIC. Instead of conventional filtering or regeneration, SynC focuses on reassigning captions to the most semantically aligned images already present within the synthetic image pool.
  • SafeWork-R1: Coevolving Safety and Intelligence under the AI-45$^{\circ}$ Law: This paper introduces SafeWork-R1, a multimodal reasoning model developed using the SafeLadder framework. SafeWork-R1 demonstrates the coevolution of capabilities and safety, achieving significant improvements on safety-related benchmarks without compromising general capabilities.

Model Updates

  • Llama-3.3-Nemotron-Super-49B-v1.5 GGUF Models: Mungert and bartowski have released GGUF versions of the Llama-3.3-Nemotron-Super-49B-v1.5 model. This model is a derivative of Meta's Llama-3.3-70B-Instruct, enhanced for reasoning, chat preferences, and agentic tasks. It supports a context length of 128K tokens and is designed for commercial use. The model utilizes Neural Architecture Search (NAS) for optimized efficiency and accuracy.
  • Qwen3-235B-A22B-Instruct-2507-3bit-DWQ: The mlx-community has released a 3-bit quantized version of the Qwen3-235B-A22B-Instruct-2507 model in MLX format, converted from the 8-bit version.
  • EXL3 Quants of nvidia/Llama-3_3-Nemotron-Super-49B-v1_5: ArtusDev has released EXL3 quants of the Llama-3.3-Nemotron-Super-49B-v1_5 model, providing various quantization levels for different performance and memory requirements.

Key Takeaways

  • Efficient Reasoning Models: The Llama-3.3-Nemotron-Super-49B-v1.5 model showcases advancements in creating efficient reasoning models through Neural Architecture Search, balancing accuracy and computational cost.
  • Synthetic Data for Adaptation: The SIDA paper highlights the potential of using synthetic images to improve zero-shot domain adaptation, offering a more efficient alternative to text-driven methods.
  • Safety-Oriented AI: The SafeWork-R1 model demonstrates the feasibility of coevolving safety and intelligence in AI systems, achieving state-of-the-art safety performance without sacrificing general capabilities.
  • Quantization Techniques: The releases of GGUF, MLX, and EXL3 quantized models emphasize the ongoing efforts to optimize large language models for broader accessibility and efficient deployment on various hardware platforms.

AI Papers for 2026-04-25

Seeing Fast and Slow: Learning the Flow of Time in Videos

How can we tell whether a video has been sped up or slowed down? How can we generate videos at different speeds? Although videos have been central to modern computer vision research, little attention has been paid to perceiving and controlling the passage of time. In this paper, we study time as a learnable visual concept and develop models for reasoning about and manipulating the flow of time in videos. We first exploit the multimodal cues and temporal structure naturally present in videos to learn, in a self-supervised manner, to detect speed changes and estimate playback speed. We then show that these learned temporal reasoning models enable us to curate the largest slow-motion video dataset to date from noisy in-the-wild sources. Such slow-motion footage, typically filmed by high-speed cameras, contains substantially richer temporal detail than standard videos. Using this data, we further develop models capable of temporal control, including speed-conditioned video generation, which produces motion at specified playback speed, and temporal super-resolution, which tranforms low-FPS, blurry videos into high-FPS sequences with fine-grained temporal details. Our findings highlight time as a manipulable, perceptual dimension in video learning, opening doors to temporally controllable video generation, temporal forensics detection, and potentially richer world-models that understand how events unfold over time.

When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

Despite impressive progress in capabilities of large vision-language models (LVLMs), these systems remain vulnerable to hallucinations, i.e., outputs that are not grounded in the visual input. Prior work has attributed hallucinations in LVLMs to factors such as limitations of the vision backbone or the dominance of the language component, yet the relative importance of these factors remains unclear. To resolve this ambiguity, We propose HalluScope, a benchmark to better understand the extent to which different factors induce hallucinations. Our analysis indicates that hallucinations largely stem from excessive reliance on textual priors and background knowledge, especially information introduced through textual instructions. To mitigate hallucinations induced by textual instruction priors, we propose HalluVL-DPO, a framework for fine-tuning off-the-shelf LVLMs towards more visually grounded responses. HalluVL-DPO leverages preference optimization using a curated training dataset that we construct, guiding the model to prefer grounded responses over hallucinated ones. We demonstrate that our optimized model effectively mitigates the targeted hallucination failure mode, while preserving or improving performance on other hallucination benchmarks and visual capability evaluations. To support reproducibility and further research, we will publicly release our evaluation benchmark, preference training dataset, and code at https://pegah-kh.github.io/projects/prompts-override-vision/ .

From Research Question to Scientific Workflow: Leveraging Agentic AI for Science Automation

Scientific workflow systems automate execution -- scheduling, fault tolerance, resource management -- but not the semantic translation that precedes it. Scientists still manually convert research questions into workflow specifications, a task requiring both domain knowledge and infrastructure expertise. We propose an agentic architecture that closes this gap through three layers: an LLM interprets natural language into structured intents (semantic layer); validated generators produce reproducible workflow DAGs (deterministic layer); and domain experts author ``Skills'': markdown documents encoding vocabulary mappings, parameter constraints, and optimization strategies (knowledge layer). This decomposition confines LLM non-determinism to intent extraction: identical intents always yield identical workflows. We implement and evaluate the architecture on the 1000 Genomes population genetics workflow and Hyperflow WMS running on Kubernetes. In an ablation study on 150 queries, Skills raise full-match intent accuracy from 44% to 83%; skill-driven deferred workflow generation reduces data transfer by 92\%; and the end-to-end pipeline completes queries on Kubernetes with LLM overhead below 15 seconds and cost under $0.001 per query.

A Scale-Adaptive Framework for Joint Spatiotemporal Super-Resolution with Diffusion Models

Deep-learning video super-resolution has progressed rapidly, but climate applications typically super-resolve (increase resolution) either space or time, and joint spatiotemporal models are often designed for a single pair of super-resolution (SR) factors (upscaling spatial and temporal ratio between the low-resolution sequence and the high-resolution sequence), limiting transfer across spatial resolutions and temporal cadences (frame rates). We present a scale-adaptive framework that reuses the same architecture across factors by decomposing spatiotemporal SR into a deterministic prediction of the conditional mean, with attention, and a residual conditional diffusion model, with an optional mass-conservation (same precipitation amount in inputs and outputs) transform to preserve aggregated totals. Assuming that larger SR factors primarily increase underdetermination (hence required context and residual uncertainty) rather than changing the conditional-mean structure, scale adaptivity is achieved by retuning three factor-dependent hyperparameters before retraining: the diffusion noise schedule amplitude beta (larger for larger factors to increase diversity), the temporal context length L (set to maintain comparable attention horizons across cadences) and optionally a third, the mass-conservation function f (tapered to limit the amplification of extremes for large factors). Demonstrated on reanalysis precipitation over France (Comephore), the same architecture spans super-resolution factors from 1 to 25 in space and 1 to 6 in time, yielding a reusable architecture and tuning recipe for joint spatiotemporal super-resolution across scales.

GiVA: Gradient-Informed Bases for Vector-Based Adaptation

As model sizes continue to grow, parameter-efficient fine-tuning has emerged as a powerful alternative to full fine-tuning. While LoRA is widely adopted among these methods, recent research has explored vector-based adaptation methods due to their extreme parameter efficiency. However, these methods typically require substantially higher ranks than LoRA to match its performance, leading to increased training costs. This work introduces GiVA, a gradient-based initialization strategy for vector-based adaptation. It achieves training times comparable to LoRA and maintains the extreme parameter efficiency of vector-based adaptation. We evaluate GiVA across diverse benchmarks, including natural language understanding, natural language generation, and image classification. Experiments show that our approach consistently outperforms or achieves performance competitive with existing vector-based adaptation methods and LoRA while reducing rank requirements by a factor of eight ($8\times$).

Nemobot Games: Crafting Strategic AI Gaming Agents for Interactive Learning with Large Language Models

This paper introduces a new paradigm for AI game programming, leveraging large language models (LLMs) to extend and operationalize Claude Shannon's taxonomy of game-playing machines. Central to this paradigm is Nemobot, an interactive agentic engineering environment that enables users to create, customize, and deploy LLM-powered game agents while actively engaging with AI-driven strategies. The LLM-based chatbot, integrated within Nemobot, demonstrates its capabilities across four distinct classes of games. For dictionary-based games, it compresses state-action mappings into efficient, generalized models for rapid adaptability. In rigorously solvable games, it employs mathematical reasoning to compute optimal strategies and generates human-readable explanations for its decisions. For heuristic-based games, it synthesizes strategies by combining insights from classical minimax algorithms (see, e.g., shannon1950chess) with crowd-sourced data. Finally, in learning-based games, it utilizes reinforcement learning with human feedback and self-critique to iteratively refine strategies through trial-and-error and imitation learning. Nemobot amplifies this framework by offering a programmable environment where users can experiment with tool-augmented generation and fine-tuning of strategic game agents. From strategic games to role-playing games, Nemobot demonstrates how AI agents can achieve a form of self-programming by integrating crowdsourced learning and human creativity to iteratively refine their own logic. This represents a step toward the long-term goal of self-programming AI.

A Multi-Stage Warm-Start Deep Learning Framework for Unit Commitment

Maintaining instantaneous balance between electricity supply and demand is critical for reliability and grid instability. System operators achieve this through solving the task of Unit Commitment (UC),ca high dimensional large-scale Mixed-integer Linear Programming (MILP) problem that is strictly and heavily governed by the grid physical constraints. As grid integrate variable renewable sources, and new technologies such as long duration storage in the grid, UC must be optimally solved for multi-day horizons and potentially with greater frequency. Therefore, traditional MILP solvers increasingly struggle to compute solutions within these tightening operational time limits. To bypass these computational bottlenecks, this paper proposes a novel framework utilizing a transformer-based architecture to predict generator commitment schedules over a 72-hour horizon. Also, because raw predictions in highly dimensional spaces often yield physically infeasible results, the pipeline integrates the self-attention network with deterministic post-processing heuristics that systematically enforce minimum up/down times and minimize excess capacity. Finally, these refined predictions are utilized as a warm start for a downstream MILP solver, while employing a confidence-based variable fixation strategy to drastically reduce the combinatorial search space. Validated on a single-bus test system, the complete multi-stage pipeline achieves 100\% feasibility and significantly accelerates computation times. Notably, in approximately 20\% of test instances, the proposed model reached a feasible operational schedule with a lower overall system cost than relying solely on the solver.

TingIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale

Real-time detection and mitigation of technical anomalies are critical for large-scale cloud-native services, where even minutes of downtime can result in massive financial losses and diminished user trust. While customer incidents serve as a vital signal for discovering risks missed by monitoring, extracting actionable intelligence from this data remains challenging due to extreme noise, high throughput, and semantic complexity of diverse business lines. In this paper, we present TingIS, an end-to-end system designed for enterprise-grade incident discovery. At the core of TingIS is a multi-stage event linking engine that synergizes efficient indexing techniques with Large Language Models (LLMs) to make informed decisions on event merging, enabling the stable extraction of actionable incidents from just a handful of diverse user descriptions. This engine is complemented by a cascaded routing mechanism for precise business attribution and a multi-dimensional noise reduction pipeline that integrates domain knowledge, statistical patterns, and behavioral filtering. Deployed in a production environment handling a peak throughput of over 2,000 messages per minute and 300,000 messages per day, TingIS achieves a P90 alert latency of 3.5 minutes and a 95\% discovery rate for high-priority incidents. Benchmarks constructed from real-world data demonstrate that TingIS significantly outperforms baseline methods in routing accuracy, clustering quality, and Signal-to-Noise Ratio.

A Multimodal Text- and Graph-Based Approach for Open-Domain Event Extraction from Documents

Event extraction is essential for event understanding and analysis. It supports tasks such as document summarization and decision-making in emergency scenarios. However, existing event extraction approaches have limitations: (1) closed-domain algorithms are restricted to predefined event types and thus rarely generalize to unseen types and (2) open-domain event extraction algorithms, capable of handling unconstrained event types, have largely overlooked the potential of large language models (LLMs) despite their advanced abilities. Additionally, they do not explicitly model document-level contextual, structural, and semantic reasoning, which are crucial for effective event extraction but remain challenging for LLMs due to lost-in-the-middle phenomenon and attention dilution. To address these limitations, we propose multimodal open-domain event extraction, MODEE , a novel approach for open-domain event extraction that combines graph-based learning with text-based representation from LLMs to model document-level reasoning. Empirical evaluations on large datasets demonstrate that MODEE outperforms state-of-the-art open-domain event extraction approaches and can be generalized to closed-domain event extraction, where it outperforms existing algorithms.

Addressing Image Authenticity When Cameras Use Generative AI

The ability of generative AI (GenAI) methods to photorealistically alter camera images has raised awareness about the authenticity of images shared online. Interestingly, images captured directly by our cameras are considered authentic and faithful. However, with the increasing integration of deep-learning modules into cameras' capture-time hardware -- namely, the image signal processor (ISP) -- there is now a potential for hallucinated content in images directly output by our cameras. Hallucinated capture-time image content is typically benign, such as enhanced edges or texture, but in certain operations, such as AI-based digital zoom or low-light image enhancement, hallucinations can potentially alter the semantics and interpretation of the image content. As a result, users may not realize that the content in their camera images is not authentic. This paper addresses this issue by enabling users to recover the 'unhallucinated' version of the camera image to avoid misinterpretation of the image content. Our approach works by optimizing an image-specific multi-layer perceptron (MLP) decoder together with a modality-specific encoder so that, given the camera image, we can recover the image before hallucinated content was added. The encoder and MLP are self-contained and can be applied post-capture to the image without requiring access to the camera ISP. Moreover, the encoder and MLP decoder require only 180 KB of storage and can be readily saved as metadata within standard image formats such as JPEG and HEIC.

AI Models

unsloth/DeepSeek-V4-Flash


license: mit library_name: transformers base_model:

  • deepseek-ai/DeepSeek-V4-Flash

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence

<!-- markdownlint-disable first-line-h1 --> <!-- markdownlint-disable html --> <!-- markdownlint-disable no-duplicate-header --> <div align="center"> <img src="https://github.com/deepseek-ai/DeepSeek-V2/blob/main/figures/logo.svg?raw=true" width="60%" alt="DeepSeek-V4" /> </div> <hr> <div align="center" style="line-height: 1;"> <a href="https://www.deepseek.com/" target="_blank" style="margin: 2px;"> <img alt="Homepage" src="https://github.com/deepseek-ai/DeepSeek-V2/blob/main/figures/badge.svg?raw=true" style="display: inline-block; vertical-align: middle;"/> </a> <a href="https://chat.deepseek.com/" target="_blank" style="margin: 2px;"> <img alt="Chat" src="https://img.shields.io/badge/๐Ÿค–%20Chat-DeepSeek%20V4-536af5?color=536af5&logoColor=white" style="display: inline-block; vertical-align: middle;"/> </a> </div> <div align="center" style="line-height: 1;"> <a href="https://huggingface.co/deepseek-ai" target="_blank" style="margin: 2px;"> <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-DeepSeek%20AI-ffc107?color=ffc107&logoColor=white" style="display: inline-block; vertical-align: middle;"/> </a> <a href="https://twitter.com/deepseek_ai" target="_blank" style="margin: 2px;"> <img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-deepseek_ai-white?logo=x&logoColor=white" style="display: inline-block; vertical-align: middle;"/> </a> </div> <div align="center" style="line-height: 1;"> <a href="LICENSE" style="margin: 2px;"> <img alt="License" src="https://img.shields.io/badge/License-MIT-f5de53?&color=f5de53" style="display: inline-block; vertical-align: middle;"/> </a> </div> <p align="center"> <a href="https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf"><b>Technical Report</b>๐Ÿ‘๏ธ</a> </p>

Introduction

We present a preview version of DeepSeek-V4 series, including two strong Mixture-of-Experts (MoE) language models โ€” DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated) โ€” both supporting a context length of one million tokens.

DeepSeek-V4 series incorporate several key upgrades in architecture and optimization:

  1. Hybrid Attention Architecture: We design a hybrid attention mechanism combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to dramatically improve long-context efficiency. In the 1M-token context setting, DeepSeek-V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared with DeepSeek-V3.2.
  2. Manifold-Constrained Hyper-Connections (mHC): We incorporate mHC to strengthen conventional residual connections, enhancing stability of signal propagation across layers while preserving model expressivity.
  3. Muon Optimizer: We employ the Muon optimizer for faster convergence and greater training stability.

We pre-train both models on more than 32T diverse and high-quality tokens, followed by a comprehensive post-training pipeline. The post-training features a two-stage paradigm: independent cultivation of domain-specific experts (through SFT and RL with GRPO), followed by unified model consolidation via on-policy distillation, integrating distinct proficiencies across diverse domains into a single model.

DeepSeek-V4-Pro-Max, the maximum reasoning effort mode of DeepSeek-V4-Pro, significantly advances the knowledge capabilities of open-source models, firmly establishing itself as the best open-source model available today. It achieves top-tier performance in coding benchmarks and significantly bridges the gap with leading closed-source models on reasoning and agentic tasks. Meanwhile, DeepSeek-V4-Flash-Max achieves comparable reasoning performance to the Pro version when given a larger thinking budget, though its smaller parameter scale naturally places it slightly behind on pure knowledge tasks and the most complex agentic workflows.

<div align="center"> <img src="assets/dsv4_performance.png" > </div>

Model Downloads

<div align="center">

| Model | #Total Params | #Activated Params | Context Length | Precision | Download | | :---: | :---: | :---: | :---: | :---: | :---: | | DeepSeek-V4-Flash-Base | 284B | 13B | 1M | FP8 Mixed | HuggingFace | ModelScope | | DeepSeek-V4-Flash | 284B | 13B | 1M | FP4 + FP8 Mixed* | HuggingFace | ModelScope | | DeepSeek-V4-Pro-Base | 1.6T | 49B | 1M | FP8 Mixed | HuggingFace | ModelScope | | DeepSeek-V4-Pro | 1.6T | 49B | 1M | FP4 + FP8 Mixed* | HuggingFace | ModelScope |

</div>

*FP4 + FP8 Mixed: MoE expert parameters use FP4 precision; most other parameters use FP8.

Evaluation Results

Base Model

<div align="center">

| Benchmark (Metric) | # Shots | DeepSeek-V3.2-Base | DeepSeek-V4-Flash-Base | DeepSeek-V4-Pro-Base | | :--- | :---: | :---: | :---: | :---: | | Architecture | - | MoE | MoE | MoE | | # Activated Params | - | 37B | 13B | 49B | | # Total Params | - | 671B | 284B | 1.6T | | World Knowledge | | | | | | AGIEval (EM) | 0-shot | 80.1 | 82.6 | 83.1 | | MMLU (EM) | 5-shot | 87.8 | 88.7 | 90.1 | | MMLU-Redux (EM) | 5-shot | 87.5 | 89.4 | 90.8 | | MMLU-Pro (EM) | 5-shot | 65.5 | 68.3 | 73.5 | | MMMLU (EM) | 5-shot | 87.9 | 88.8 | 90.3 | | C-Eval (EM) | 5-shot | 90.4 | 92.1 | 93.1 | | CMMLU (EM) | 5-shot | 88.9 | 90.4 | 90.8 | | MultiLoKo (EM) | 5-shot | 38.7 | 42.2 | 51.1 | | Simple-QA verified (EM) | 25-shot | 28.3 | 30.1 | 55.2 | | SuperGPQA (EM) | 5-shot | 45.0 | 46.5 | 53.9 | | FACTS Parametric (EM) | 25-shot | 27.1 | 33.9 | 62.6 | | TriviaQA (EM) | 5-shot | 83.3 | 82.8 | 85.6 | | Language & Reasoning | | | | | | BBH (EM) | 3-shot | 87.6 | 86.9 | 87.5 | | DROP (F1) | 1-shot | 88.2 | 88.6 | 88.7 | | HellaSwag (EM) | 0-shot | 86.4 | 85.7 | 88.0 | | WinoGrande (EM) | 0-shot | 78.9 | 79.5 | 81.5 | | CLUEWSC (EM) | 5-shot | 83.5 | 82.2 | 85.2 | | Code & Math | | | | | | BigCodeBench (Pass@1) | 3-shot | 63.9 | 56.8 | 59.2 | | HumanEval (Pass@1) | 0-shot | 62.8 | 69.5 | 76.8 | | GSM8K (EM) | 8-shot | 91.1 | 90.8 | 92.6 | | MATH (EM) | 4-shot | 60.5 | 57.4 | 64.5 | | MGSM (EM) | 8-shot | 81.3 | 85.7 | 84.4 | | CMath (EM) | 3-shot | 92.6 | 93.6 | 90.9 | | Long Context | | | | | | LongBench-V2 (EM) | 1-shot | 40.2 | 44.7 | 51.5 |

</div>

Instruct Model

DeepSeek-V4-Pro and DeepSeek-V4-Flash both support three reasoning effort modes:

| Reasoning Mode | Characteristics | Typical Use Cases | Response Format | | :--- | :--- | :--- | :--- | | Non-think | Fast, intuitive responses | Routine daily tasks, low-risk decisions | </think> summary | | Think High | Conscious logical analysis, slower but more accurate | Complex problem-solving, planning | <think> thinking </think> summary | | Think Max | Push reasoning to its fullest extent | Exploring the boundary of model reasoning capability | Special system prompt + <think> thinking </think> summary |

DeepSeek-V4-Pro-Max vs Frontier Models

<div align="center">

| Benchmark (Metric) | Opus-4.6 Max | GPT-5.4 xHigh | Gemini-3.1-Pro High | K2.6 Thinking | GLM-5.1 Thinking | DS-V4-Pro Max | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | | Knowledge & Reasoning | | | | | | | | MMLU-Pro (EM) | 89.1 | 87.5 | 91.0 | 87.1 | 86.0 | 87.5 | | SimpleQA-Verified (Pass@1) | 46.2 | 45.3 | 75.6 | 36.9 | 38.1 | 57.9 | | Chinese-SimpleQA (Pass@1) | 76.4 | 76.8 | 85.9 | 75.9 | 75.0 | 84.4 | | GPQA Diamond (Pass@1) | 91.3 | 93.0 | 94.3 | 90.5 | 86.2 | 90.1 | | HLE (Pass@1) | 40.0 | 39.8 | 44.4 | 36.4 | 34.7 | 37.7 | | LiveCodeBench (Pass@1) | 88.8 | - | 91.7 | 89.6 | - | 93.5 | | Codeforces (Rating) | - | 3168 | 3052 | - | - | 3206 | | HMMT 2026 Feb (Pass@1) | 96.2 | 97.7 | 94.7 | 92.7 | 89.4 | 95.2 | | IMOAnswerBench (Pass@1) | 75.3 | 91.4 | 81.0 | 86.0 | 83.8 | 89.8 | | Apex (Pass@1) | 34.5 | 54.1 | 60.9 | 24.0 | 11.5 | 38.3 | | Apex Shortlist (Pass@1) | 85.9 | 78.1 | 89.1 | 75.5 | 72.4 | 90.2 | | Long Context | | | | | | | | MRCR 1M (MMR) | 92.9 | - | 76.3 | - | - | 83.5 | | CorpusQA 1M (ACC) | 71.7 | - | 53.8 | - | - | 62.0 | | Agentic | | | | | | | | Terminal Bench 2.0 (Acc) | 65.4 | 75.1 | 68.5 | 66.7 | 63.5 | 67.9 | | SWE Verified (Resolved) | 80.8 | - | 80.6 | 80.2 | - | 80.6 | | SWE Pro (Resolved) | 57.3 | 57.7 | 54.2 | 58.6 | 58.4 | 55.4 | | SWE Multilingual (Resolved) | 77.5 | - | - | 76.7 | 73.3 | 76.2 | | BrowseComp (Pass@1) | 83.7 | 82.7 | 85.9 | 83.2 | 79.3 | 83.4 | | HLE w/ tools (Pass@1) | 53.1 | 52.0 | 51.6 | 54.0 | 50.4 | 48.2 | | GDPval-AA (Elo) | 1619 | 1674 | 1314 | 1482 | 1535 | 1554 | | MCPAtlas Public (Pass@1) | 73.8 | 67.2 | 69.2 | 66.6 | 71.8 | 73.6 | | Toolathlon (Pass@1) | 47.2 | 54.6 | 48.8 | 50.0 | 40.7 | 51.8 |

</div>

Comparison across Modes

<div align="center">

| Benchmark (Metric) | V4-Flash Non-Think | V4-Flash High | V4-Flash Max | V4-Pro Non-Think | V4-Pro High | V4-Pro Max | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | | Knowledge & Reasoning | | | | | | | | MMLU-Pro (EM) | 83.0 | 86.4 | 86.2 | 82.9 | 87.1 | 87.5 | | SimpleQA-Verified (Pass@1) | 23.1 | 28.9 | 34.1 | 45.0 | 46.2 | 57.9 | | Chinese-SimpleQA (Pass@1) | 71.5 | 73.2 | 78.9 | 75.8 | 77.7 | 84.4 | | GPQA Diamond (Pass@1) | 71.2 | 87.4 | 88.1 | 72.9 | 89.1 | 90.1 | | HLE (Pass@1) | 8.1 | 29.4 | 34.8 | 7.7 | 34.5 | 37.7 | | LiveCodeBench (Pass@1) | 55.2 | 88.4 | 91.6 | 56.8 | 89.8 | 93.5 | | Codeforces (Rating) | - | 2816 | 3052 | - | 2919 | 3206 | | HMMT 2026 Feb (Pass@1) | 40.8 | 91.9 | 94.8 | 31.7 | 94.0 | 95.2 | | IMOAnswerBench (Pass@1) | 41.9 | 85.1 | 88.4 | 35.3 | 88.0 | 89.8 | | Apex (Pass@1) | 1.0 | 19.1 | 33.0 | 0.4 | 27.4 | 38.3 | | Apex Shortlist (Pass@1) | 9.3 | 72.1 | 85.7 | 9.2 | 85.5 | 90.2 | | Long Context | | | | | | | | MRCR 1M (MMR) | 37.5 | 76.9 | 78.7 | 44.7 | 83.3 | 83.5 | | CorpusQA 1M (ACC) | 15.5 | 59.3 | 60.5 | 35.6 | 56.5 | 62.0 | | Agentic | | | | | | | | Terminal Bench 2.0 (Acc) | 49.1 | 56.6 | 56.9 | 59.1 | 63.3 | 67.9 | | SWE Verified (Resolved) | 73.7 | 78.6 | 79.0 | 73.6 | 79.4 | 80.6 | | SWE Pro (Resolved) | 49.1 | 52.3 | 52.6 | 52.1 | 54.4 | 55.4 | | SWE Multilingual (Resolved) | 69.7 | 70.2 | 73.3 | 69.8 | 74.1 | 76.2 | | BrowseComp (Pass@1) | - | 53.5 | 73.2 | - | 80.4 | 83.4 | | HLE w/ tools (Pass@1) | - | 40.3 | 45.1 | - | 44.7 | 48.2 | | MCPAtlas (Pass@1) | 64.0 | 67.4 | 69.0 | 69.4 | 74.2 | 73.6 | | GDPval-AA (Elo) | - | - | 1395 | - | - | 1554 | | Toolathlon (Pass@1) | 40.7 | 43.5 | 47.8 | 46.3 | 49.0 | 51.8 |

</div>

Chat Template

This release does not include a Jinja-format chat template. Instead, we provide a dedicated encoding folder with Python scripts and test cases demonstrating how to encode messages in OpenAI-compatible format into input strings for the model, and how to parse the model's text output. Please refer to the encoding folder for full documentation.

A brief example:

from encoding_dsv4 import encode_messages, parse_message_from_completion_text

messages = [
    {"role": "user", "content": "hello"},
    {"role": "assistant", "content": "Hello! I am DeepSeek.", "reasoning_content": "thinking..."},
    {"role": "user", "content": "1+1=?"}
]

# messages -> string
prompt = encode_messages(messages, thinking_mode="thinking")

# string -> tokens
import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V4-Pro")
tokens = tokenizer.encode(prompt)

How to Run Locally

Please refer to the inference folder for detailed instructions on running DeepSeek-V4 locally, including model weight conversion and interactive chat demos.

For local deployment, we recommend setting the sampling parameters to temperature = 1.0, top_p = 1.0. For the Think Max reasoning mode, we recommend setting the context window to at least 384K tokens.

License

This repository and the model weights are licensed under the MIT License.

Citation

@misc{deepseekai2026deepseekv4,
      title={DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence},
      author={DeepSeek-AI},
      year={2026},
}

Contact

If you have any questions, please raise an issue or contact us at service@deepseek.com.

Author: unsloth

Likes: 23

Downloads: 0

Tags: transformers, safetensors, deepseek_v4, text-generation, base_model:deepseek-ai/DeepSeek-V4-Flash, base_model:quantized:deepseek-ai/DeepSeek-V4-Flash, license:mit, endpoints_compatible, 8-bit, fp8, region:us

unsloth/DeepSeek-V4-Pro


license: mit library_name: transformers base_model:

  • deepseek-ai/DeepSeek-V4-Pro

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence

<!-- markdownlint-disable first-line-h1 --> <!-- markdownlint-disable html --> <!-- markdownlint-disable no-duplicate-header --> <div align="center"> <img src="https://github.com/deepseek-ai/DeepSeek-V2/blob/main/figures/logo.svg?raw=true" width="60%" alt="DeepSeek-V4" /> </div> <hr> <div align="center" style="line-height: 1;"> <a href="https://www.deepseek.com/" target="_blank" style="margin: 2px;"> <img alt="Homepage" src="https://github.com/deepseek-ai/DeepSeek-V2/blob/main/figures/badge.svg?raw=true" style="display: inline-block; vertical-align: middle;"/> </a> <a href="https://chat.deepseek.com/" target="_blank" style="margin: 2px;"> <img alt="Chat" src="https://img.shields.io/badge/๐Ÿค–%20Chat-DeepSeek%20V4-536af5?color=536af5&logoColor=white" style="display: inline-block; vertical-align: middle;"/> </a> </div> <div align="center" style="line-height: 1;"> <a href="https://huggingface.co/deepseek-ai" target="_blank" style="margin: 2px;"> <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-DeepSeek%20AI-ffc107?color=ffc107&logoColor=white" style="display: inline-block; vertical-align: middle;"/> </a> <a href="https://twitter.com/deepseek_ai" target="_blank" style="margin: 2px;"> <img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-deepseek_ai-white?logo=x&logoColor=white" style="display: inline-block; vertical-align: middle;"/> </a> </div> <div align="center" style="line-height: 1;"> <a href="LICENSE" style="margin: 2px;"> <img alt="License" src="https://img.shields.io/badge/License-MIT-f5de53?&color=f5de53" style="display: inline-block; vertical-align: middle;"/> </a> </div> <p align="center"> <a href="https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf"><b>Technical Report</b>๐Ÿ‘๏ธ</a> </p>

Introduction

We present a preview version of DeepSeek-V4 series, including two strong Mixture-of-Experts (MoE) language models โ€” DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated) โ€” both supporting a context length of one million tokens.

DeepSeek-V4 series incorporate several key upgrades in architecture and optimization:

  1. Hybrid Attention Architecture: We design a hybrid attention mechanism combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to dramatically improve long-context efficiency. In the 1M-token context setting, DeepSeek-V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared with DeepSeek-V3.2.
  2. Manifold-Constrained Hyper-Connections (mHC): We incorporate mHC to strengthen conventional residual connections, enhancing stability of signal propagation across layers while preserving model expressivity.
  3. Muon Optimizer: We employ the Muon optimizer for faster convergence and greater training stability.

We pre-train both models on more than 32T diverse and high-quality tokens, followed by a comprehensive post-training pipeline. The post-training features a two-stage paradigm: independent cultivation of domain-specific experts (through SFT and RL with GRPO), followed by unified model consolidation via on-policy distillation, integrating distinct proficiencies across diverse domains into a single model.

DeepSeek-V4-Pro-Max, the maximum reasoning effort mode of DeepSeek-V4-Pro, significantly advances the knowledge capabilities of open-source models, firmly establishing itself as the best open-source model available today. It achieves top-tier performance in coding benchmarks and significantly bridges the gap with leading closed-source models on reasoning and agentic tasks. Meanwhile, DeepSeek-V4-Flash-Max achieves comparable reasoning performance to the Pro version when given a larger thinking budget, though its smaller parameter scale naturally places it slightly behind on pure knowledge tasks and the most complex agentic workflows.

<div align="center"> <img src="assets/dsv4_performance.png" > </div>

Model Downloads

<div align="center">

| Model | #Total Params | #Activated Params | Context Length | Precision | Download | | :---: | :---: | :---: | :---: | :---: | :---: | | DeepSeek-V4-Flash-Base | 284B | 13B | 1M | FP8 Mixed | HuggingFace | ModelScope | | DeepSeek-V4-Flash | 284B | 13B | 1M | FP4 + FP8 Mixed* | HuggingFace | ModelScope | | DeepSeek-V4-Pro-Base | 1.6T | 49B | 1M | FP8 Mixed | HuggingFace | ModelScope | | DeepSeek-V4-Pro | 1.6T | 49B | 1M | FP4 + FP8 Mixed* | HuggingFace | ModelScope |

</div>

*FP4 + FP8 Mixed: MoE expert parameters use FP4 precision; most other parameters use FP8.

Evaluation Results

Base Model

<div align="center">

| Benchmark (Metric) | # Shots | DeepSeek-V3.2-Base | DeepSeek-V4-Flash-Base | DeepSeek-V4-Pro-Base | | :--- | :---: | :---: | :---: | :---: | | Architecture | - | MoE | MoE | MoE | | # Activated Params | - | 37B | 13B | 49B | | # Total Params | - | 671B | 284B | 1.6T | | World Knowledge | | | | | | AGIEval (EM) | 0-shot | 80.1 | 82.6 | 83.1 | | MMLU (EM) | 5-shot | 87.8 | 88.7 | 90.1 | | MMLU-Redux (EM) | 5-shot | 87.5 | 89.4 | 90.8 | | MMLU-Pro (EM) | 5-shot | 65.5 | 68.3 | 73.5 | | MMMLU (EM) | 5-shot | 87.9 | 88.8 | 90.3 | | C-Eval (EM) | 5-shot | 90.4 | 92.1 | 93.1 | | CMMLU (EM) | 5-shot | 88.9 | 90.4 | 90.8 | | MultiLoKo (EM) | 5-shot | 38.7 | 42.2 | 51.1 | | Simple-QA verified (EM) | 25-shot | 28.3 | 30.1 | 55.2 | | SuperGPQA (EM) | 5-shot | 45.0 | 46.5 | 53.9 | | FACTS Parametric (EM) | 25-shot | 27.1 | 33.9 | 62.6 | | TriviaQA (EM) | 5-shot | 83.3 | 82.8 | 85.6 | | Language & Reasoning | | | | | | BBH (EM) | 3-shot | 87.6 | 86.9 | 87.5 | | DROP (F1) | 1-shot | 88.2 | 88.6 | 88.7 | | HellaSwag (EM) | 0-shot | 86.4 | 85.7 | 88.0 | | WinoGrande (EM) | 0-shot | 78.9 | 79.5 | 81.5 | | CLUEWSC (EM) | 5-shot | 83.5 | 82.2 | 85.2 | | Code & Math | | | | | | BigCodeBench (Pass@1) | 3-shot | 63.9 | 56.8 | 59.2 | | HumanEval (Pass@1) | 0-shot | 62.8 | 69.5 | 76.8 | | GSM8K (EM) | 8-shot | 91.1 | 90.8 | 92.6 | | MATH (EM) | 4-shot | 60.5 | 57.4 | 64.5 | | MGSM (EM) | 8-shot | 81.3 | 85.7 | 84.4 | | CMath (EM) | 3-shot | 92.6 | 93.6 | 90.9 | | Long Context | | | | | | LongBench-V2 (EM) | 1-shot | 40.2 | 44.7 | 51.5 |

</div>

Instruct Model

DeepSeek-V4-Pro and DeepSeek-V4-Flash both support three reasoning effort modes:

| Reasoning Mode | Characteristics | Typical Use Cases | Response Format | | :--- | :--- | :--- | :--- | | Non-think | Fast, intuitive responses | Routine daily tasks, low-risk decisions | </think> summary | | Think High | Conscious logical analysis, slower but more accurate | Complex problem-solving, planning | <think> thinking </think> summary | | Think Max | Push reasoning to its fullest extent | Exploring the boundary of model reasoning capability | Special system prompt + <think> thinking </think> summary |

DeepSeek-V4-Pro-Max vs Frontier Models

<div align="center">

| Benchmark (Metric) | Opus-4.6 Max | GPT-5.4 xHigh | Gemini-3.1-Pro High | K2.6 Thinking | GLM-5.1 Thinking | DS-V4-Pro Max | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | | Knowledge & Reasoning | | | | | | | | MMLU-Pro (EM) | 89.1 | 87.5 | 91.0 | 87.1 | 86.0 | 87.5 | | SimpleQA-Verified (Pass@1) | 46.2 | 45.3 | 75.6 | 36.9 | 38.1 | 57.9 | | Chinese-SimpleQA (Pass@1) | 76.4 | 76.8 | 85.9 | 75.9 | 75.0 | 84.4 | | GPQA Diamond (Pass@1) | 91.3 | 93.0 | 94.3 | 90.5 | 86.2 | 90.1 | | HLE (Pass@1) | 40.0 | 39.8 | 44.4 | 36.4 | 34.7 | 37.7 | | LiveCodeBench (Pass@1) | 88.8 | - | 91.7 | 89.6 | - | 93.5 | | Codeforces (Rating) | - | 3168 | 3052 | - | - | 3206 | | HMMT 2026 Feb (Pass@1) | 96.2 | 97.7 | 94.7 | 92.7 | 89.4 | 95.2 | | IMOAnswerBench (Pass@1) | 75.3 | 91.4 | 81.0 | 86.0 | 83.8 | 89.8 | | Apex (Pass@1) | 34.5 | 54.1 | 60.9 | 24.0 | 11.5 | 38.3 | | Apex Shortlist (Pass@1) | 85.9 | 78.1 | 89.1 | 75.5 | 72.4 | 90.2 | | Long Context | | | | | | | | MRCR 1M (MMR) | 92.9 | - | 76.3 | - | - | 83.5 | | CorpusQA 1M (ACC) | 71.7 | - | 53.8 | - | - | 62.0 | | Agentic | | | | | | | | Terminal Bench 2.0 (Acc) | 65.4 | 75.1 | 68.5 | 66.7 | 63.5 | 67.9 | | SWE Verified (Resolved) | 80.8 | - | 80.6 | 80.2 | - | 80.6 | | SWE Pro (Resolved) | 57.3 | 57.7 | 54.2 | 58.6 | 58.4 | 55.4 | | SWE Multilingual (Resolved) | 77.5 | - | - | 76.7 | 73.3 | 76.2 | | BrowseComp (Pass@1) | 83.7 | 82.7 | 85.9 | 83.2 | 79.3 | 83.4 | | HLE w/ tools (Pass@1) | 53.1 | 52.0 | 51.6 | 54.0 | 50.4 | 48.2 | | GDPval-AA (Elo) | 1619 | 1674 | 1314 | 1482 | 1535 | 1554 | | MCPAtlas Public (Pass@1) | 73.8 | 67.2 | 69.2 | 66.6 | 71.8 | 73.6 | | Toolathlon (Pass@1) | 47.2 | 54.6 | 48.8 | 50.0 | 40.7 | 51.8 |

</div>

Comparison across Modes

<div align="center">

| Benchmark (Metric) | V4-Flash Non-Think | V4-Flash High | V4-Flash Max | V4-Pro Non-Think | V4-Pro High | V4-Pro Max | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | | Knowledge & Reasoning | | | | | | | | MMLU-Pro (EM) | 83.0 | 86.4 | 86.2 | 82.9 | 87.1 | 87.5 | | SimpleQA-Verified (Pass@1) | 23.1 | 28.9 | 34.1 | 45.0 | 46.2 | 57.9 | | Chinese-SimpleQA (Pass@1) | 71.5 | 73.2 | 78.9 | 75.8 | 77.7 | 84.4 | | GPQA Diamond (Pass@1) | 71.2 | 87.4 | 88.1 | 72.9 | 89.1 | 90.1 | | HLE (Pass@1) | 8.1 | 29.4 | 34.8 | 7.7 | 34.5 | 37.7 | | LiveCodeBench (Pass@1) | 55.2 | 88.4 | 91.6 | 56.8 | 89.8 | 93.5 | | Codeforces (Rating) | - | 2816 | 3052 | - | 2919 | 3206 | | HMMT 2026 Feb (Pass@1) | 40.8 | 91.9 | 94.8 | 31.7 | 94.0 | 95.2 | | IMOAnswerBench (Pass@1) | 41.9 | 85.1 | 88.4 | 35.3 | 88.0 | 89.8 | | Apex (Pass@1) | 1.0 | 19.1 | 33.0 | 0.4 | 27.4 | 38.3 | | Apex Shortlist (Pass@1) | 9.3 | 72.1 | 85.7 | 9.2 | 85.5 | 90.2 | | Long Context | | | | | | | | MRCR 1M (MMR) | 37.5 | 76.9 | 78.7 | 44.7 | 83.3 | 83.5 | | CorpusQA 1M (ACC) | 15.5 | 59.3 | 60.5 | 35.6 | 56.5 | 62.0 | | Agentic | | | | | | | | Terminal Bench 2.0 (Acc) | 49.1 | 56.6 | 56.9 | 59.1 | 63.3 | 67.9 | | SWE Verified (Resolved) | 73.7 | 78.6 | 79.0 | 73.6 | 79.4 | 80.6 | | SWE Pro (Resolved) | 49.1 | 52.3 | 52.6 | 52.1 | 54.4 | 55.4 | | SWE Multilingual (Resolved) | 69.7 | 70.2 | 73.3 | 69.8 | 74.1 | 76.2 | | BrowseComp (Pass@1) | - | 53.5 | 73.2 | - | 80.4 | 83.4 | | HLE w/ tools (Pass@1) | - | 40.3 | 45.1 | - | 44.7 | 48.2 | | MCPAtlas (Pass@1) | 64.0 | 67.4 | 69.0 | 69.4 | 74.2 | 73.6 | | GDPval-AA (Elo) | - | - | 1395 | - | - | 1554 | | Toolathlon (Pass@1) | 40.7 | 43.5 | 47.8 | 46.3 | 49.0 | 51.8 |

</div>

Chat Template

This release does not include a Jinja-format chat template. Instead, we provide a dedicated encoding folder with Python scripts and test cases demonstrating how to encode messages in OpenAI-compatible format into input strings for the model, and how to parse the model's text output. Please refer to the encoding folder for full documentation.

A brief example:

from encoding_dsv4 import encode_messages, parse_message_from_completion_text

messages = [
    {"role": "user", "content": "hello"},
    {"role": "assistant", "content": "Hello! I am DeepSeek.", "reasoning_content": "thinking..."},
    {"role": "user", "content": "1+1=?"}
]

# messages -> string
prompt = encode_messages(messages, thinking_mode="thinking")

# string -> tokens
import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V4-Pro")
tokens = tokenizer.encode(prompt)

How to Run Locally

Please refer to the inference folder for detailed instructions on running DeepSeek-V4 locally, including model weight conversion and interactive chat demos.

For local deployment, we recommend setting the sampling parameters to temperature = 1.0, top_p = 1.0. For the Think Max reasoning mode, we recommend setting the context window to at least 384K tokens.

License

This repository and the model weights are licensed under the MIT License.

Citation

@misc{deepseekai2026deepseekv4,
      title={DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence},
      author={DeepSeek-AI},
      year={2026},
}

Contact

If you have any questions, please raise an issue or contact us at service@deepseek.com.

Author: unsloth

Likes: 20

Downloads: 0

Tags: transformers, safetensors, deepseek_v4, text-generation, base_model:deepseek-ai/DeepSeek-V4-Pro, base_model:quantized:deepseek-ai/DeepSeek-V4-Pro, license:mit, endpoints_compatible, 8-bit, fp8, region:us

FINAL-Bench/Darwin-9B-NEG


license: apache-2.0 base_model:

  • FINAL-Bench/Darwin-9B-Opus tags:
  • darwin
  • darwin-v8
  • darwin-neg
  • native-entropy-gating
  • NEG
  • reasoning
  • self-regulated-reasoning
  • advanced-reasoning
  • thinking
  • qwen3.5
  • qwen
  • gpqa
  • benchmark
  • open-source
  • apache-2.0
  • hybrid-vigor
  • proto-agi
  • vidraft
  • eval-results language:
  • en
  • zh
  • ko
  • ja
  • multilingual pipeline_tag: text-generation library_name: transformers model-index:
  • name: Darwin-9B-NEG results:
    • task: type: text-generation name: Graduate-Level Reasoning dataset: type: Idavidrein/gpqa name: GPQA Diamond config: gpqa_diamond split: train metrics:
      • type: accuracy value: 84.34 name: Accuracy verified: false

Darwin-9B-NEG โ€” The First Native Entropy Gating Model

<p align="center"> <a href="https://huggingface.co/FINAL-Bench/Darwin-9B-NEG"><img src="https://img.shields.io/badge/โญ_GPQA_Diamond-84.34%25_Darwin--9B--NEG-gold?style=for-the-badge" alt="GPQA"></a> <a href="https://huggingface.co/FINAL-Bench/Darwin-9B-Opus"><img src="https://img.shields.io/badge/๐Ÿงฌ_Base-Darwin--9B--Opus-blue?style=for-the-badge" alt="Base"></a> </p> <p align="center"> <a href="https://huggingface.co/FINAL-Bench/Darwin-4B-Genesis"><img src="https://img.shields.io/badge/๐Ÿงฌ_Model-Darwin--4B--Genesis-blue?style=for-the-badge" alt="Genesis"></a> <a href="https://huggingface.co/FINAL-Bench/Darwin-9B-Opus"><img src="https://img.shields.io/badge/๐Ÿงฌ_Model-Darwin--9B--Opus-blue?style=for-the-badge" alt="9B"></a> <a href="https://huggingface.co/FINAL-Bench/Darwin-27B-Opus"><img src="https://img.shields.io/badge/๐Ÿงฌ_Model-Darwin--27B--Opus-blue?style=for-the-badge" alt="27B"></a> <a href="https://huggingface.co/FINAL-Bench/Darwin-31B-Opus"><img src="https://img.shields.io/badge/๐Ÿงฌ_Model-Darwin--31B--Opus-blue?style=for-the-badge" alt="31B"></a> <a href="https://huggingface.co/FINAL-Bench/Darwin-36B-Opus"><img src="https://img.shields.io/badge/๐Ÿงฌ_Model-Darwin--36B--Opus-blue?style=for-the-badge" alt="36B"></a> </p> <p align="center"> <a href="https://huggingface.co/collections/FINAL-Bench/darwin-family"><img src="https://img.shields.io/badge/๐Ÿ _Darwin_Family-Collection-green?style=for-the-badge" alt="Family"></a> <a href="https://huggingface.co/spaces/FINAL-Bench/Leaderboard"><img src="https://img.shields.io/badge/๐Ÿ†_FINAL_Bench-Leaderboard-green?style=for-the-badge" alt="FINAL Bench"></a> </p>

Qwen3.5-9B backbone ยท 8.95B parameters ยท BF16 ยท Thinking Mode ยท Apache 2.0 The first NEG-enabled model โ€” self-regulating reasoning with no extra library.


Abstract

Darwin-9B-NEG is the first model in the Darwin series to feature Native Entropy Gating (NEG) โ€” a proprietary Darwin architectural innovation that embeds a sense of self-confidence directly into the model weights. Unlike external multi-turn iteration (MTI) techniques that require 3ร—โ€“8ร— extra inference, NEG operates inside the single decoding loop and activates in fewer than 5 % of generation steps, lifting reasoning accuracy by more than 12 percentage points at 1ร— inference cost.

On the GPQA Diamond PhD-level reasoning benchmark (198 questions), Darwin-9B-NEG scores 84.34 % with the full 3-stage ensemble protocol โ€” surpassing even the published Qwen3.5-9B leaderboard result (81.7 %).


What Makes Darwin-9B-NEG Different

๐Ÿงฌ Darwin Series โ€” Evolutionary Model Merging

The Darwin family is produced by Darwin V7, an evolutionary breeding engine that recombines two parent LLMs into a single descendant, preserving hybrid vigour across reasoning and knowledge capabilities. Darwin-9B-Opus โ€” this model's base โ€” is the Qwen3.5-family member of the Darwin series, previously published as a stand-alone reasoning model.

โšก NEG โ€” Native Entropy Gating (Darwin V8)

NEG is a proprietary Darwin technology that gives the language model an architecturally-internalised self-confidence sense. Two tiny learnable modules ride alongside the transformer:

  • NEG-Head (โ‰ˆ 4 M params, ~ 0.05 % of total weights) predicts, at each step, the entropy of the next-token distribution from the last hidden state.
  • NEG-Gate (1 learnable threshold) decides, on a per-token basis, whether the model is "confident enough" to commit to its top choice, or whether it should restrict its choice to a narrow top-k subset.

Because NEG is carried inside the model weights themselves, there is nothing extra to ship or to install: standard transformers loading with trust_remote_code=True attaches the modules automatically. The model file is the feature.

Why it matters

  • 1ร— inference cost โ€” no multi-sample voting, no multi-turn loops
  • < 5 % gate activation โ€” negligible latency overhead versus the base model
  • +12.63 %p on GPQA Diamond vs. the NEG-free Darwin-9B-Opus baseline (same greedy decoding, same prompt, same tokens)
  • Single-file deployment โ€” drop in to vLLM / SGLang / TGI / transformers, no new engine required
  • No trade-secret leaks โ€” the merge recipe is kept internal; only the final model weights are released under Apache 2.0

๐Ÿ—๏ธ Architecture Overview

Input Text
    โ†“
[Darwin-9B-Opus backbone (frozen during NEG training)]
    โ†“
Transformer Layers ร— 32
    โ†“
last hidden state โ”€โ”€โ”
    โ”‚               โ”‚
    โ–ผ               โ–ผ
 LM Head         NEG-Head
    โ”‚               โ”‚
  base logits    predicted entropy
    โ”‚               โ”‚
    โ””โ”€โ”€โ–ถ NEG-Gate โ—€โ”€โ”˜
            โ”‚
            โ–ผ
       guided logits
            โ”‚
            โ–ผ
        next token

Key Specifications

| Component | Value | |:---|:---| | Architecture | Qwen3.5 decoder-only transformer (32 layers, hidden 4096) | | Total parameters | 8.95 B (base) + โ‰ˆ 4 M (NEG modules) | | NEG-Head | 2-layer MLP with softplus output | | NEG-Gate | top-k masking gate with learnable entropy threshold | | Precision | bfloat16 | | Context length | inherited from Darwin-9B-Opus | | License | Apache 2.0 |


๐Ÿ† Benchmark Results โ€” GPQA Diamond (198 PhD-level questions)

Darwin-9B-NEG ships three decoding modes from the same model weights, allowing users to trade inference cost for accuracy:

| Mode | Decoding Protocol | Inference Cost | Accuracy | |:---:|:---|:---:|:---:| | 0 ยท Baseline | Darwin-9B-Opus greedy (NEG disabled) | 1ร— | 51.01 % | | 1 ยท Pure NEG | greedy decoding with NEG enabled | 1ร— | 63.64 % | | 2 ยท Permutation | NEG + choice-order permutation (4 orderings, majority) | 4ร— | 76.26 % | | 3 ยท Ensemble Refinement | NEG + permutation + temperature-sampled ensemble | โ‰ˆ 20ร— | ๐Ÿฅ‡ 84.34 % |

Improvements:

  • Pure NEG (mode 1) vs. baseline: +12.63 %p at identical inference cost
  • Ensemble (mode 3) vs. baseline: +33.33 %p
  • Ensemble vs. Qwen3.5-9B leaderboard score (81.7 %): +2.64 %p

Gate activation rate: 4.36 % (measured across the 198-question greedy run) โ€” NEG fires conservatively, only when the model is genuinely uncertain.


๐Ÿš€ Usage

Quick start โ€” Pure NEG greedy (mode 1, sales default)

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tok = AutoTokenizer.from_pretrained(
    "FINAL-Bench/Darwin-9B-NEG",
    trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
    "FINAL-Bench/Darwin-9B-NEG",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {"role": "user", "content": "Solve: If f(x) = xยณ โˆ’ 3x + 2, find and classify all critical points."}
]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048, do_sample=False)
print(tok.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True))

Using the bundled NEG loader helper

modeling_darwin_neg.py is shipped inside the repo and provides a convenience loader:

from modeling_darwin_neg import load_darwin_neg

model = load_darwin_neg(
    "FINAL-Bench/Darwin-9B-NEG",
    hf_token="hf_xxx",
)

Mode selection

  • Mode 1 (Pure NEG): default do_sample=False, NEG is always on.
  • Mode 2 (Permutation): shuffle the option order 4 times, greedy each, majority-vote.
  • Mode 3 (Ensemble): production protocol combining permutation, temperature sampling and second-opinion re-query (internal; reproduction scripts are released separately).

๐Ÿงฌ Model Lineage

Qwen/Qwen3.5-9B   +   (Opus-distilled sibling)
         โ•ฒ                โ•ฑ
          Darwin V7 evolutionary merge
                   โ–ผ
          Darwin-9B-Opus  โ”€โ”€ stand-alone reasoning model (Apache 2.0)
                   โ–ผ
          NEG-Head / NEG-Gate training (Darwin V8)
                   โ–ผ
          Darwin-9B-NEG  โ”€โ”€ THIS MODEL
  • Base: FINAL-Bench/Darwin-9B-Opus (weights frozen during NEG training)
  • Technology generation: Darwin V8 (Native Entropy Gating) โ€” successor to Darwin V7 (evolutionary merging)

๐ŸŽฏ Recommended Use-Cases

  • Graduate-level STEM reasoning โ€” physics, chemistry, biology, mathematics (GPQA-style)
  • Mathematical problem solving (MATH, AIME-style)
  • Code reasoning and debugging (HumanEval-style)
  • Complex chain-of-thought tasks where a small reasoning model with a big boost is desired

โš ๏ธ Limitations

  • Optimised for English first, with secondary support for Korean / Chinese / Japanese.
  • At 8.95 B parameters, knowledge coverage is smaller than the larger Darwin models (27B / 31B / 36B) โ€” for pure world-knowledge tasks consider Darwin-36B-Opus.
  • The Ensemble mode (84.34 %) uses โ‰ˆ 20ร— inference; choose Pure NEG (mode 1) for cost-sensitive deployments.

๐Ÿ“š Citation

@misc{darwin9b_neg_2026,
  title  = {Darwin-9B-NEG: Native Entropy Gating for Self-Regulated Reasoning at 1x Inference Cost},
  author = {FINAL-Bench / Darwin Research Team},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/FINAL-Bench/Darwin-9B-NEG}},
  note   = {Darwin V8 โ€” Native Entropy Gating technology generation}
}

๐Ÿ”— Related Darwin Models

  • Darwin-36B-Opus โ€” MoE 36B, Qwen3.6-35B-A3B ร— Opus distilled, GPQA 88.4 %
  • Darwin-31B-Opus โ€” 31B multilingual-strong reasoning
  • Darwin-27B-Opus โ€” 27B dense, GPQA 86.9 %
  • Darwin-28B-Opus โ€” Qwen3.6-27B ร— rico03 Opus distilled (new 2026-04)
  • Darwin-9B-Opus โ€” this model's base, Qwen3.5-9B family
  • Darwin-4B-Genesis โ€” smallest member, Gemma4 family

Darwin V8 ยท Sealed 2026-04-24 ยท FINAL-Bench

Author: FINAL-Bench

Likes: 20

Downloads: 0

Tags: transformers, safetensors, qwen3_5, image-text-to-text, darwin, darwin-v8, darwin-neg, native-entropy-gating, NEG, reasoning, self-regulated-reasoning, advanced-reasoning, thinking, qwen3.5, qwen, gpqa, benchmark, open-source, apache-2.0, hybrid-vigor, proto-agi, vidraft, eval-results, text-generation, conversational, en, zh, ko, ja, multilingual, base_model:FINAL-Bench/Darwin-9B-Opus, base_model:finetune:FINAL-Bench/Darwin-9B-Opus, license:apache-2.0, model-index, endpoints_compatible, region:us

FINAL-Bench/Darwin-28B-Opus


license: apache-2.0 language:

  • en
  • zh
  • ko
  • ja
  • multilingual library_name: transformers pipeline_tag: text-generation tags:
  • darwin
  • darwin-v7
  • evolutionary-merge
  • merge
  • mergekit
  • reasoning
  • advanced-reasoning
  • chain-of-thought
  • thinking
  • qwen3.6
  • qwen
  • claude-opus
  • distillation
  • multilingual
  • gpqa
  • benchmark
  • open-source
  • apache-2.0
  • hybrid-vigor
  • proto-agi
  • vidraft
  • eval-results base_model:
  • Qwen/Qwen3.6-27B
  • rico03/Qwen3.6-27B-Claude-Opus-Reasoning-Distilled base_model_relation: merge model-index:
  • name: Darwin-28B-Opus results:
    • task: type: text-generation name: Graduate-Level Reasoning dataset: type: Idavidrein/gpqa name: GPQA Diamond config: gpqa_diamond split: train metrics:
      • type: accuracy value: 88.89 name: Accuracy verified: false

Darwin-28B-Opus โ€” Qwen3.6-27B ร— Opus-Distilled Evolutionary Merge

<p align="center"> <a href="https://huggingface.co/FINAL-Bench/Darwin-28B-Opus"><img src="https://img.shields.io/badge/โญ_GPQA_Diamond-88.89%25_Darwin--28B--Opus-gold?style=for-the-badge" alt="GPQA"></a> <a href="https://huggingface.co/FINAL-Bench/Darwin-36B-Opus"><img src="https://img.shields.io/badge/๐Ÿงฌ_Sibling-Darwin--36B--Opus_(88.4%25)-blue?style=for-the-badge" alt="36B"></a> </p> <p align="center"> <a href="https://huggingface.co/FINAL-Bench/Darwin-4B-Genesis"><img src="https://img.shields.io/badge/๐Ÿงฌ_Model-Darwin--4B--Genesis-blue?style=for-the-badge" alt="Genesis"></a> <a href="https://huggingface.co/FINAL-Bench/Darwin-9B-Opus"><img src="https://img.shields.io/badge/๐Ÿงฌ_Model-Darwin--9B--Opus-blue?style=for-the-badge" alt="9B"></a> <a href="https://huggingface.co/FINAL-Bench/Darwin-9B-NEG"><img src="https://img.shields.io/badge/โšก_Model-Darwin--9B--NEG_(84.3%25)-purple?style=for-the-badge" alt="NEG"></a> <a href="https://huggingface.co/FINAL-Bench/Darwin-27B-Opus"><img src="https://img.shields.io/badge/๐Ÿงฌ_Model-Darwin--27B--Opus_(86.9%25)-blue?style=for-the-badge" alt="27B"></a> </p> <p align="center"> <a href="https://huggingface.co/FINAL-Bench/Darwin-31B-Opus"><img src="https://img.shields.io/badge/๐Ÿงฌ_Model-Darwin--31B--Opus_(85.9%25)-blue?style=for-the-badge" alt="31B"></a> <a href="https://huggingface.co/FINAL-Bench/Darwin-36B-Opus"><img src="https://img.shields.io/badge/โญ_Model-Darwin--36B--Opus_(88.4%25)-blue?style=for-the-badge" alt="36B"></a> </p> <p align="center"> <a href="https://huggingface.co/collections/FINAL-Bench/darwin-family"><img src="https://img.shields.io/badge/๐Ÿ _Darwin_Family-Collection-green?style=for-the-badge" alt="Family"></a> <a href="https://huggingface.co/spaces/FINAL-Bench/Leaderboard"><img src="https://img.shields.io/badge/๐Ÿ†_FINAL_Bench-Leaderboard-green?style=for-the-badge" alt="FINAL Bench"></a> </p>

Qwen3.6-27B dense ยท 27.6B parameters ยท Hybrid Linear/Full Attention ยท BF16 ยท Thinking Mode ยท Apache 2.0 Darwin V7 evolutionary merge: Father ร— Opus-distilled Mother โ†’ 88.89% on GPQA Diamond (3-stage adaptive evaluation)


Abstract

Darwin-28B-Opus is the first reasoning model of the Darwin series built on the Qwen3.6 generation backbone. Produced by the Darwin V7 evolutionary breeding engine from two publicly available parents, it combines the strong bilingual reasoning of Qwen3.6-27B with Claude Opus 4-style chain-of-thought distilled behaviour.

On the GPQA Diamond graduate-level reasoning benchmark (198 PhD-level questions), Darwin-28B-Opus scores 88.89 % under the standard 3-stage adaptive evaluation, slightly edging out its larger MoE sibling Darwin-36B-Opus (88.4 %) and clearly surpassing its Qwen3.5-generation counterpart Darwin-27B-Opus (86.9 %).


๐Ÿงฌ Model Lineage

| Role | Model | Role in the Merge | |:---:|:---|:---| | Father (็ˆถ) | Qwen/Qwen3.6-27B | Qwen3.6 generation dense backbone with hybrid linear/full attention. | | Mother (ๆฏ) | rico03/Qwen3.6-27B-Claude-Opus-Reasoning-Distilled | Claude Opus reasoning-distilled variant of the same backbone (Jackrong-style distillation, 14 k traces). | | Offspring | Darwin-28B-Opus (this model) | Darwin V7 evolutionary merge; Qwen3.6 architecture retained, Opus reasoning style inherited. |

Why 28B? The 28B label denotes the Qwen3.6-generation member of the Darwin lineup (+1 over the Qwen3.5-era Darwin-27B-Opus). The actual parameter count is 27.6 B, and the architecture exactly follows Qwen3.6-27B.


โš™๏ธ Technical Specifications

| Component | Value | |:---|:---| | Architecture | Qwen3_5ForConditionalGeneration (Qwen3.6 generation, hybrid linear + full attention) | | Parameters | 27.6 B (BF16) | | Hidden size | 5 120 | | Intermediate size | 17 408 | | Head dim | 256 | | Layers | 64 (3 linear : 1 full attention, full_attention_interval = 4) | | Precision | bfloat16 | | Context length | Inherited from base (long-chain reasoning supported) | | License | Apache 2.0 |


๐Ÿ† Benchmark โ€” GPQA Diamond (198 questions)

Darwin-28B-Opus is evaluated under our standard 3-stage adaptive evaluation protocol, identical to the protocol used across the Darwin series.

| Stage | Decoding Protocol | Cost | Accuracy | |:---:|:---|:---:|:---:| | Stage 1 | Single-shot greedy baseline | 1ร— | 74.75 % (148 / 198) | | Stage 2 | Majority vote ร—8 at temperature 0.7 on Stage-1 wrongs | 8ร— | 83.84 % (166 / 198) | | Stage 3 | Adaptive ensemble refinement (close-tie tiebreaker + iterative MTI on residual hard questions) | โ‰ˆ 20ร— | ๐Ÿฅ‡ 88.89 % (176 / 198) |

Key performance indicators:

  • Stage 1 โ†’ Stage 3: +14.14 %p through adaptive protocol
  • vs Darwin-27B-Opus (86.9 %): +1.99 %p
  • vs Darwin-36B-Opus (88.4 %): +0.49 %p
  • vs Darwin-31B-Opus (85.9 %): +2.99 %p

๐Ÿš€ Usage

Standard inference (Stage 1 baseline)

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tok = AutoTokenizer.from_pretrained(
    "FINAL-Bench/Darwin-28B-Opus",
    trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
    "FINAL-Bench/Darwin-28B-Opus",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {"role": "user",
     "content": "Solve: If f(x) = xยณ โˆ’ 3x + 2, find all critical points and classify them."}
]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048, do_sample=False)
print(tok.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True))

Enhanced accuracy (Stage 2-3 adaptive)

For leaderboard-grade accuracy, combine:

  1. Stage 1 greedy baseline,
  2. Stage 2 maj@8 temperature sampling on low-confidence answers,
  3. Stage 3 adaptive refinement on still-disputed answers.

Reference implementation is provided in the Darwin-series evaluation harness.


๐ŸŽฏ Recommended Use-Cases

  • Graduate-level STEM reasoning (GPQA / science qualifying exams)
  • Mathematical problem solving (MATH, AIME-style problems)
  • Code generation and debugging (HumanEval, MBPP)
  • Complex multi-step chain-of-thought tasks
  • Bilingual reasoning (strong English + Korean; also Chinese / Japanese)

โš ๏ธ Limitations

  • At 27.6 B parameters in bfloat16, full inference requires โ‰ˆ 55 GB of VRAM (e.g., a single A100-80GB or B200).
  • Optimised for English first, with secondary support for Korean, Chinese, and Japanese.
  • Deep Opus-style reasoning traces tend to be verbose โ€” control with max_new_tokens as needed.

๐Ÿ“š Citation

@misc{darwin28b_opus_2026,
  title  = {Darwin-28B-Opus: Evolutionary Merging of Qwen3.6-27B with Claude-Opus-Distilled Reasoning},
  author = {FINAL-Bench / Darwin Research Team},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/FINAL-Bench/Darwin-28B-Opus}},
  note   = {Darwin V7 ยท Mother-centric Ratio Interpolation merge ยท 88.89 % GPQA Diamond (3-stage)}
}

๐Ÿ”— Related Darwin Models

  • Darwin-36B-Opus โ€” MoE 36B, Qwen3.6-35B-A3B ร— Opus distilled, GPQA 88.4 %
  • Darwin-31B-Opus โ€” 31B dense, multilingual-strong reasoning, GPQA 85.9 %
  • Darwin-27B-Opus โ€” 27B dense (Qwen3.5 generation), GPQA 86.9 %
  • Darwin-9B-NEG โ€” 9B with Native Entropy Gating, GPQA 84.3 %
  • Darwin-9B-Opus โ€” the Qwen3.5-9B Darwin member
  • Darwin-4B-Genesis โ€” smallest Darwin member

Darwin V7 ยท Qwen3.6 generation flagship ยท Sealed 2026-04-25 ยท FINAL-Bench

Author: FINAL-Bench

Likes: 10

Downloads: 0

Tags: transformers, safetensors, qwen3_5, image-text-to-text, darwin, darwin-v7, evolutionary-merge, merge, mergekit, reasoning, advanced-reasoning, chain-of-thought, thinking, qwen3.6, qwen, claude-opus, distillation, multilingual, gpqa, benchmark, open-source, apache-2.0, hybrid-vigor, proto-agi, vidraft, eval-results, text-generation, conversational, en, zh, ko, ja, base_model:Qwen/Qwen3.6-27B, base_model:merge:Qwen/Qwen3.6-27B, base_model:rico03/Qwen3.6-27B-Claude-Opus-Reasoning-Distilled, base_model:merge:rico03/Qwen3.6-27B-Claude-Opus-Reasoning-Distilled, license:apache-2.0, model-index, endpoints_compatible, region:us

spiritbuun/Qwen3.6-27B-DFlash-GGUF


license: mit library_name: gguf base_model: z-lab/Qwen3.6-27B-DFlash tags:

  • gguf
  • speculative-decoding
  • dflash
  • drafter
  • llama.cpp
  • quantized

Qwen3.6-27B-DFlash โ€” GGUF (Q4_K_M + Q8_0)

llama.cpp quantizations of z-lab/Qwen3.6-27B-DFlash, the block-diffusion drafter for DFlash speculative decoding. Pair it with Qwen/Qwen3.6-27B (or a quant of it).

Two quants are published:

| File | Size | Recommended? | |---|---|---| | dflash-draft-3.6-q8_0.gguf | 1.75 GB | Yes โ€” use this. Matches F16 acceptance. | | dflash-draft-3.6-q4_k_m.gguf | 1.03 GB | Only if VRAM-constrained; acceptance drops ~17 points. |

Unlike the 3.5 drafter (all full-attention, Q4-robust), the 3.6 drafter introduces causal sliding-window attention layers (pattern [S,S,S,S,F], window = 2048). Those SWA layers are Q4-fragile โ€” Q4_K_M collapses acceptance from ~43 % โ†’ ~28 % on the same workload. Q8_0 is the smallest quant that preserves F16 quality and happens to run slightly faster than F16 in our benchmarks.

Requirements

DFlash speculative decoding is not yet in upstream llama.cpp. You need the fork:

  • Fork: spiritbuun/buun-llama-cpp (branch master)
  • SWA support for the DFlash drafter landed in commit b9d01582b (SD-073). Older checkpoints will load the drafter but produce garbage.
  • Built with: cmake -B build -DGGML_CUDA=ON -DGGML_NATIVE=ON -DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON

Usage

llama-server

./build/bin/llama-server \
    -m   /path/to/Qwen3.6-27B-target.Q4_K_M.gguf \
    -md  /path/to/dflash-draft-3.6-q8_0.gguf \
    --spec-type dflash \
    -ngl 99 -ngld 99 \
    -np 1 -c 6048 -cd 256 \
    -fa on -b 256 -ub 64 \
    --host 0.0.0.0 --port 8080 --jinja \
    --chat-template-kwargs '{"enable_thinking": false}'

Thinking footgun: the Qwen3.6 chat template enables <think>โ€ฆ</think> by default. That collapses DFlash acceptance because the drafter wasn't trained on the think-wrapped distribution. Pass --chat-template-kwargs '{"enable_thinking": false}' to disable it (โ‰ˆ1.8ร— throughput uplift).

llama-speculative-simple

./build/bin/llama-speculative-simple \
    -m   /path/to/Qwen3.6-27B-target.Q4_K_M.gguf \
    -md  /path/to/dflash-draft-3.6-q8_0.gguf \
    --spec-type dflash \
    -ngl 99 -ngld 99 \
    -c 4096 --draft-max 16 --draft-min 1 \
    -p "Write a Python mergesort."

Observed performance (RTX 3090, llama-server, Qwen3.6-27B UD-Q4_K_XL target, Python BST code prompt, temp = 0, 400 tokens, thinking OFF)

| Drafter quant | Raw (t/s) | Raw accept | Chat (t/s) | Chat accept | |----------------------|----------:|-----------:|-----------:|------------:| | Q8_0 (recommended) | 87 | 37 % | 97 | 43 % | | F16 | 80 | 36 % | 93 | 45 % | | Q4_K_M | 73 | 29 % | 70 | 28 % |

Q8_0 tracks F16 within noise and is half the size.

Note on comparison with the 3.5 drafter

Short-context code prompts do not exercise the sliding-window attention (most queries fall inside the 2048-token window anyway), so the 3.6 drafter's architectural change doesn't produce a dramatic win on this benchmark. The SWA infrastructure is expected to matter on longer-context workloads (> 2 k generated tokens). On short code, Q8_0 on 3.6 is โ‰ˆ1.3ร— the throughput of Q4_K_M on 3.5 because the 3.6 target pairs slightly better with the retrained drafter.

Quantization details

  • Source: z-lab/Qwen3.6-27B-DFlash (BF16 safetensors, 2 B parameters)
  • Converter: convert_hf_to_gguf.py from spiritbuun/buun-llama-cpp โ€” emits qwen35.attention.sliding_window + qwen35.attention.sliding_window_pattern so the runtime builds per-layer SWA masks
  • Quants: llama-quantize โ†’ Q4_K_M, Q8_0
  • Tensors: drafter transformer (5 layers, pattern [S,S,S,S,F], window = 2048) + projection heads + cross-attention layers targeting Qwen3.6-27B layer ids [1, 16, 31, 46, 61]

Reproducing the conversion

Tokenizer heads-up: the upstream z-lab/Qwen3.6-27B-DFlash repo ships only config.json, model.safetensors, and a README โ€” no tokenizer files. The drafter shares the target model's tokenizer. Copy the Qwen3.6 tokenizer files into the drafter directory first.

# 1. Pull the DFlash drafter weights
hf download z-lab/Qwen3.6-27B-DFlash --local-dir ./dflash-drafter-3.6

# 2. Pull tokenizer files from the target model into the same directory
hf download Qwen/Qwen3.6-27B \
    tokenizer.json tokenizer_config.json vocab.json merges.txt \
    special_tokens_map.json \
    --local-dir ./dflash-drafter-3.6

# 3. Convert to GGUF (F16 first, then quantize)
python convert_hf_to_gguf.py ./dflash-drafter-3.6 \
    --outtype f16 \
    --outfile dflash-draft-3.6-f16.gguf

# 4. Quantize
./build/bin/llama-quantize dflash-draft-3.6-f16.gguf dflash-draft-3.6-q8_0.gguf Q8_0
./build/bin/llama-quantize dflash-draft-3.6-f16.gguf dflash-draft-3.6-q4_k_m.gguf Q4_K_M

Required files in ./dflash-drafter-3.6/ before step 3:

| File | Source | |---|---| | config.json | z-lab/Qwen3.6-27B-DFlash (has architectures: ["DFlashDraftModel"], use_sliding_window: true, layer_types: [...]) | | model.safetensors | z-lab/Qwen3.6-27B-DFlash | | tokenizer.json, tokenizer_config.json, vocab.json, merges.txt, special_tokens_map.json | Qwen/Qwen3.6-27B |

The converter auto-detects DFlashDraftModel from config.json and emits the SWA metadata when use_sliding_window is set.


Original model card โ€” z-lab/Qwen3.6-27B-DFlash

Reproduced from the upstream model page. License: MIT.

Overview

Qwen3.6-27B-DFlash is a lightweight drafter component for DFlash speculative decoding. It must be used with the target model Qwen/Qwen3.6-27B.

  • Paper: https://arxiv.org/abs/2602.06036
  • GitHub: https://github.com/z-lab/dflash
  • Blog: https://z-lab.ai/projects/dflash/
  • Model Size: 2B parameters (BF16)

What is DFlash?

DFlash is a novel speculative decoding method using a lightweight block diffusion model for drafting, enabling efficient, high-quality parallel drafting that significantly speeds up inference.

Upstream Quick Start (vLLM / SGLang)

vLLM

uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly

vllm serve Qwen/Qwen3.6-27B \
  --speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-27B-DFlash", "num_speculative_tokens": 15}' \
  --attention-backend flash_attn \
  --max-num-batched-tokens 32768

SGLang

python -m sglang.launch_server \
    --model-path Qwen/Qwen3.6-27B \
    --speculative-algorithm DFLASH \
    --speculative-draft-model-path z-lab/Qwen3.6-27B-DFlash \
    --speculative-num-draft-tokens 16 \
    --tp-size 1 \
    --attention-backend fa3 \
    --mem-fraction-static 0.75 \
    --trust-remote-code

Citation

@article{chen2026dflash,
  title   = {{DFlash: Block Diffusion for Flash Speculative Decoding}},
  author  = {Chen, Jian and Liang, Yesheng and Liu, Zhijian},
  journal = {arXiv preprint arXiv:2602.06036},
  year    = {2026}
}

License

MIT โ€” inherited from the upstream model. This repository redistributes quantized derivatives under the same terms.

Author: spiritbuun

Likes: 9

Downloads: 0

Tags: gguf, speculative-decoding, dflash, drafter, llama.cpp, quantized, arxiv:2602.06036, base_model:z-lab/Qwen3.6-27B-DFlash, base_model:quantized:z-lab/Qwen3.6-27B-DFlash, license:mit, endpoints_compatible, region:us, conversational

mlx-community/deepseek-ai-DeepSeek-V4-Flash-8bit


license: mit library_name: mlx pipeline_tag: text-generation base_model: deepseek-ai/DeepSeek-V4-Flash tags:

  • mlx

mlx-community/deepseek-ai-DeepSeek-V4-Flash-8bit

This model mlx-community/deepseek-ai-DeepSeek-V4-Flash-8bit was converted to MLX format from deepseek-ai/DeepSeek-V4-Flash using mlx-lm version 0.31.3.

Use with mlx

pip install mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("mlx-community/deepseek-ai-DeepSeek-V4-Flash-8bit")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True, return_dict=False,
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)

Author: mlx-community

Likes: 7

Downloads: 0

Tags: mlx, safetensors, deepseek_v4, text-generation, base_model:deepseek-ai/DeepSeek-V4-Flash, base_model:quantized:deepseek-ai/DeepSeek-V4-Flash, license:mit, 8-bit, region:us

Comfy-Org/void-model


license: apache-2.0 tags:

  • comfyui

VOID: Video Object and Interaction Deletion

Repackaged model files for ComfyUI.

Original model repository:

  • https://huggingface.co/netflix/void-model

Place the files in the following folders:

๐Ÿ“‚ ComfyUI/
โ”œโ”€โ”€ ๐Ÿ“‚ models/
โ”‚   โ”œโ”€โ”€ ๐Ÿ“‚ diffusion_models/
โ”‚   โ”‚   โ”œโ”€โ”€ void_pass1.safetensors
โ”‚   โ”‚   โ””โ”€โ”€ void_pass2.safetensors

Author: Comfy-Org

Likes: 6

Downloads: 0

Tags: comfyui, license:apache-2.0, region:us

ubergarm/Qwen3.6-27B-GGUF


quantized_by: ubergarm pipeline_tag: text-generation base_model: Qwen/Qwen3.6-27B base_model_relation: quantized license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3.6-27B/blob/main/LICENSE tags:

  • imatrix
  • conversational
  • qwen3_5
  • ik_llama.cpp

ik_llama.cpp imatrix Quantizations of Qwen/Qwen3.6-27B

NOTE ik_llama.cpp can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants. Only a couple quants in this collection are compatible with mainline llamma.cpp/LMStudio/KoboldCPP/etc as mentioned in the specific description, all others require ik_llama.cpp.

Some of ik's new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP with Windows builds. Also check for ik_llama.cpp windows builds by Thireus here..

These quants provide best in class perplexity for the given memory footprint.

Big Thanks

Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!!

Also thanks to all the folks in the quantizing and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models! Thanks to huggingface for hosting all these big quants!

Finally, I really appreciate the support from aifoundry.org so check out their open source RISC-V based solutions!

Quant Collection

Perplexity computed against wiki.test.raw. (lower is "better")

Perplexity Chart KLD Chart

These two are just test quants for baseline perplexity comparison and not available for download here:

  • BF16 50.103 GiB (16.002 BPW)
    • PPL over 580 chunks for n_ctx=512 = 6.9066 +/- 0.04552
  • Q8_0 26.622 GiB (8.502 BPW)
    • PPL over 580 chunks for n_ctx=512 = 6.9063 +/- 0.04551

NOTE: If the models are split, the first file is much smaller and only contains metadata, that is on purpose, its fine!

IQ5_KS 18.532 GiB (5.919 BPW)

PPL over 580 chunks for n_ctx=512 = 6.9341 +/- 0.04578

This ik_llama.cpp exclusive quant is likely among the best quality available for 24GB full offload.

<details> <summary>๐Ÿ‘ˆ Secret Recipe</summary>
#!/usr/bin/env bash

custom="
# 64 Repeating Layers [0-63]

## Gated Attention/Delta Net [Blended 0-63]
blk\..*\.attn_gate\.weight=q6_0
blk\..*\.attn_qkv\.weight=q6_0
blk\..*\.attn_output\.weight=q6_0
blk\..*\.attn_q\.weight=q6_0
blk\..*\.attn_k\.weight=q6_0
blk\..*\.attn_v\.weight=q6_0
blk\..*\.ssm_alpha\.weight=q8_0
blk\..*\.ssm_beta\.weight=q8_0
blk\..*\.ssm_out\.weight=q8_0

# Dense Layers [0-63]
blk\..*\.ffn_down\.weight=iq5_ks
blk\..*\.ffn_(gate|up)\.weight=iq5_ks

# Non-Repeating Layers
token_embd\.weight=q6_0
output\.weight=q8_0
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

    #--dry-run \
numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/Qwen3.6-27B-GGUF/imatrix-Qwen3.6-27B-BF16.dat \
    /mnt/data/models/ubergarm/Qwen3.6-27B-GGUF/Qwen3.6-27B-BF16-00001-of-00002.gguf \
    /mnt/data/models/ubergarm/Qwen3.6-27B-GGUF/Qwen3.6-27B-IQ5_KS.gguf \
    IQ5_KS \
    128
</details>

smol-IQ4_NL 15.405 GiB (4.920 BPW)

PPL over 580 chunks for n_ctx=512 = 7.0040 +/- 0.04646

This mainline compatible custom mix using quantization types hopefully optimized for Vulkan/ROCm (and possibly Mac)?

<details> <summary>๐Ÿ‘ˆ Secret Recipe</summary>
#!/usr/bin/env bash

custom="
# 64 Repeating Layers [0-63]

## Gated Attention/Delta Net [Blended 0-63]
blk\..*\.attn_gate\.weight=iq4_nl
blk\..*\.attn_qkv\.weight=iq4_nl
blk\..*\.attn_output\.weight=iq4_nl
blk\..*\.attn_q\.weight=iq4_nl
blk\..*\.attn_k\.weight=iq4_nl
blk\..*\.attn_v\.weight=iq4_nl
blk\..*\.ssm_alpha\.weight=q8_0
blk\..*\.ssm_beta\.weight=q8_0
blk\..*\.ssm_out\.weight=q8_0

# Dense Layers [0-63]
blk\..*\.ffn_down\.weight=iq4_nl
blk\..*\.ffn_(gate|up)\.weight=iq4_nl

# Non-Repeating Layers
token_embd\.weight=iq4_nl
output\.weight=q8_0
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

    #--dry-run \
numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/Qwen3.6-27B-GGUF/imatrix-Qwen3.6-27B-BF16.dat \
    /mnt/data/models/ubergarm/Qwen3.6-27B-GGUF/Qwen3.6-27B-BF16-00001-of-00002.gguf \
    /mnt/data/models/ubergarm/Qwen3.6-27B-GGUF/Qwen3.6-27B-smol-IQ4_NL.gguf \
    IQ4_NL \
    128
</details>

IQ4_KS 14.693 GiB (4.693 BPW)

PPL over 580 chunks for n_ctx=512 = 6.9740 +/- 0.04599

<details> <summary>๐Ÿ‘ˆ Secret Recipe</summary>
#!/usr/bin/env bash

custom="
# 64 Repeating Layers [0-63]

## Gated Attention/Delta Net [Blended 0-63]
blk\..*\.attn_gate\.weight=iq4_ks
blk\..*\.attn_qkv\.weight=iq4_ks
blk\..*\.attn_output\.weight=iq4_ks
blk\..*\.attn_q\.weight=iq4_ks
blk\..*\.attn_k\.weight=iq4_ks
blk\..*\.attn_v\.weight=iq4_ks
blk\..*\.ssm_alpha\.weight=q6_0
blk\..*\.ssm_beta\.weight=q6_0
blk\..*\.ssm_out\.weight=q6_0

# Dense Layers [0-63]
blk\..*\.ffn_down\.weight=iq4_ks
blk\..*\.ffn_(gate|up)\.weight=iq4_ks

# Non-Repeating Layers
token_embd\.weight=q6_0
output\.weight=q8_0
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

    #--dry-run \
numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/Qwen3.6-27B-GGUF/imatrix-Qwen3.6-27B-BF16.dat \
    /mnt/data/models/ubergarm/Qwen3.6-27B-GGUF/Qwen3.6-27B-BF16-00001-of-00002.gguf \
    /mnt/data/models/ubergarm/Qwen3.6-27B-GGUF/Qwen3.6-27B-IQ4_KS-noimat.gguf \
    IQ4_KS \
    128
</details>

References

Author: ubergarm

Likes: 5

Downloads: 0

Tags: gguf, imatrix, conversational, qwen3_5, ik_llama.cpp, text-generation, base_model:Qwen/Qwen3.6-27B, base_model:quantized:Qwen/Qwen3.6-27B, license:apache-2.0, endpoints_compatible, region:us

Intel/Qwen3.6-27B-4.5b-mlx-AutoRound


base_model:

  • Qwen/Qwen3.6-27B

Model Details

This model is an mlx format 4.5b mixed model with group_size 128 and symmetric quantization of Qwen/Qwen3.6-27B generated by intel/auto-round. Please follow the license of the original model.

We currently support this format in AutoRound, but do not have the hardware to validate this large model.

As a result, we are unable to verify whether it runs correctly or achieves expected performance.

We would greatly appreciate your help in testing it, and welcome any contributions to our open-source project.

MLX-VLM inference

from mlx_vlm import generate, load
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

model_name_or_path= "Intel/Qwen3.5-4B-int4-mlx-AutoRound"

model, processor = load(model_name_or_path)
mlx_cfg = load_config(model_name_or_path)
prompt_text = "Describe this image in one sentence."
formatted = apply_chat_template(processor, mlx_cfg, prompt_text, num_images=1)
# Use a public example image so the test does not need local assets.
image_url = "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.5/demo/RealWorld/RealWorld-04.png"

output = generate(model, processor, formatted, image=[image_url], max_tokens=2048).text
print(output)

Generate the Model

this pr is required https://github.com/intel/auto-round/pull/1732

  AR_DISABLE_COPY_MTP_WEIGHTS=1 CUDA_VISIBLE_DEVICES=$device python3 -m auto_round \
  --target_bits 4.5 \
  --options "W4A16,W6A16,W8A16" \
  --model_name  $model_name \
  --ignore_scale_zp_bits \
  --format mlx \
  --output_dir "./test_mlx_mixed" \
  2>&1 | tee -a test_mlx.txt

Ethical Considerations and Limitations

The model can produce factually incorrect output, and should not be relied on to produce factually accurate information. Because of the limitations of the pretrained model and the finetuning datasets, it is possible that this model could generate lewd, biased or otherwise offensive outputs.

Therefore, before deploying any applications of the model, developers should perform safety testing.

Caveats and Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

Here are a couple of useful links to learn more about Intel's AI software:

Disclaimer

The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please consult an attorney before using this model for commercial purposes.

Cite

@article{cheng2023optimize, title={Optimize weight rounding via signed gradient descent for the quantization of llms}, author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi}, journal={arXiv preprint arXiv:2309.05516}, year={2023} }

arxiv github

Author: Intel

Likes: 5

Downloads: 0

Tags: safetensors, qwen3_5, arxiv:2309.05516, base_model:Qwen/Qwen3.6-27B, base_model:quantized:Qwen/Qwen3.6-27B, 4-bit, region:us

sgl-project/DeepSeek-V4-Pro-FP8


license: mit base_model: deepseek-ai/DeepSeek-V4-Pro tags:

  • deepseek-v4
  • fp8
  • quantized

DeepSeek-V4-Pro-FP8

FP8 re-packaging of deepseek-ai/DeepSeek-V4-Pro. Model architecture, tokenizer, chat template, and reference encoding/ are unchanged from the base repo. No fine-tuning, no retraining โ€” weights only.

Deployment

SGLang Cookbook: https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4

License

MIT โ€” see LICENSE. Copyright ยฉ DeepSeek.

Author: sgl-project

Likes: 5

Downloads: 0

Tags: safetensors, deepseek_v4, deepseek-v4, fp8, quantized, base_model:deepseek-ai/DeepSeek-V4-Pro, base_model:quantized:deepseek-ai/DeepSeek-V4-Pro, license:mit, region:us