Todays AI Summary

AI Developments: DeepSeek's Sparse Attention, Ring-1T Reasoning, and More

Today's AI landscape is buzzing with new models and research pushing the boundaries of language understanding, reasoning, and efficiency. Here's a quick rundown:

Research Highlights

  • VoiceAssistant-Eval: A new benchmark, VoiceAssistant-Eval, has been introduced to evaluate AI assistants across listening, speaking, and viewing capabilities. The benchmark reveals that proprietary models don't always outperform open-source ones, and that most models struggle with audio understanding and multimodal input.
  • See, Point, Fly: The paper See, Point, Fly introduces a training-free aerial vision-and-language navigation (AVLN) framework that leverages VLMs to decompose language instructions into 2D waypoints for UAV navigation. The framework achieves state-of-the-art performance in DRL simulation benchmarks and demonstrates strong generalization in real-world evaluations.
  • CapRL: The paper CapRL introduces a novel training framework that redefines caption quality through its utility: a high-quality caption should enable a non-visual language model to accurately answer questions about the corresponding image.

Model Spotlight

  • DeepSeek-V3.2-Exp (305 Likes): DeepSeek AI has released an experimental version of their model, DeepSeek-V3.2-Exp. This model introduces "DeepSeek Sparse Attention," a mechanism designed to improve training and inference efficiency in long-context scenarios. Benchmarks show performance on par with V3.1-Terminus, while achieving substantial improvements in long-context training and inference efficiency.
  • Ring-1T-preview (38 Likes): InclusionAI has released a preview version of Ring-1T, a trillion-parameter language model focused on natural language reasoning. The model achieves a high score of 92.6 on AIME 2025 through pure natural language reasoning.
  • aquif-3.6-8B (4 Likes): Aquif-ai has released aquif-3.6-8B, a hybrid reasoning model that automatically determines when and how deeply to think based on query complexity. It achieves 28% better token efficiency and 4% performance improvement across benchmarks.

Key Takeaways

  • Sparse Attention for Efficiency: DeepSeek's work highlights the importance of sparse attention mechanisms for improving the efficiency of large language models, especially when dealing with long contexts.
  • Reasoning-Focused Models: The release of Ring-1T-preview demonstrates the ongoing efforts to develop models with strong reasoning capabilities, pushing the boundaries of what's possible in natural language understanding.
  • Dynamic Reasoning: Aquif-ai's aquif-3.6-8B showcases the potential of hybrid reasoning models that dynamically decide if and how much to think based on query complexity.

AI Papers for 2026-03-18

Mixture-of-Depths Attention

Scaling depth is a key driver for large language models (LLMs). Yet, as LLMs become deeper, they often suffer from signal degradation: informative features formed in shallow layers are gradually diluted by repeated residual updates, making them harder to recover in deeper layers. We introduce mixture-of-depths attention (MoDA), a mechanism that allows each attention head to attend to sequence KV pairs at the current layer and depth KV pairs from preceding layers. We further describe a hardware-efficient algorithm for MoDA that resolves non-contiguous memory-access patterns, achieving 97.3% of FlashAttention-2's efficiency at a sequence length of 64K. Experiments on 1.5B-parameter models demonstrate that MoDA consistently outperforms strong baselines. Notably, it improves average perplexity by 0.2 across 10 validation benchmarks and increases average performance by 2.11% on 10 downstream tasks, with a negligible 3.7% FLOPs computational overhead. We also find that combining MoDA with post-norm yields better performance than using it with pre-norm. These results suggest that MoDA is a promising primitive for depth scaling. Code is released at https://github.com/hustvl/MoDA .

Mechanistic Origin of Moral Indifference in Language Models

Existing behavioral alignment techniques for Large Language Models (LLMs) often neglect the discrepancy between surface compliance and internal unaligned representations, leaving LLMs vulnerable to long-tail risks. More crucially, we posit that LLMs possess an inherent state of moral indifference due to compressing distinct moral concepts into uniform probability distributions. We verify and remedy this indifference in LLMs' latent representations, utilizing 251k moral vectors constructed upon Prototype Theory and the Social-Chemistry-101 dataset. Firstly, our analysis across 23 models reveals that current LLMs fail to represent the distinction between opposed moral categories and fine-grained typicality gradients within these categories; notably, neither model scaling, architecture, nor explicit alignment reshapes this indifference. We then employ Sparse Autoencoders on Qwen3-8B, isolate mono-semantic moral features, and targetedly reconstruct their topological relationships to align with ground-truth moral vectors. This representational alignment naturally improves moral reasoning and granularity, achieving a 75% pairwise win-rate on the independent adversarial Flames benchmark. Finally, we elaborate on the remedial nature of current intervention methods from an experientialist philosophy, arguing that endogenously aligned AI might require a transformation from post-hoc corrections to proactive cultivation.

Do Metrics for Counterfactual Explanations Align with User Perception?

Explainability is widely regarded as essential for trustworthy artificial intelligence systems. However, the metrics commonly used to evaluate counterfactual explanations are algorithmic evaluation metrics that are rarely validated against human judgments of explanation quality. This raises the question of whether such metrics meaningfully reflect user perceptions. We address this question through an empirical study that directly compares algorithmic evaluation metrics with human judgments across three datasets. Participants rated counterfactual explanations along multiple dimensions of perceived quality, which we relate to a comprehensive set of standard counterfactual metrics. We analyze both individual relationships and the extent to which combinations of metrics can predict human assessments. Our results show that correlations between algorithmic metrics and human ratings are generally weak and strongly dataset-dependent. Moreover, increasing the number of metrics used in predictive models does not lead to reliable improvements, indicating structural limitations in how current metrics capture criteria relevant for humans. Overall, our findings suggest that widely used counterfactual evaluation metrics fail to reflect key aspects of explanation quality as perceived by users, underscoring the need for more human-centered approaches to evaluating explainable artificial intelligence.

From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation

Accurate process supervision remains a critical challenge for long-horizon robotic manipulation. A primary bottleneck is that current video MLLMs, trained primarily under a Supervised Fine-Tuning (SFT) paradigm, function as passive "Observers" that recognize ongoing events rather than evaluating the current state relative to the final task goal. In this paper, we introduce PRIMO R1 (Process Reasoning Induced Monitoring), a 7B framework that transforms video MLLMs into active "Critics". We leverage outcome-based Reinforcement Learning to incentivize explicit Chain-of-Thought generation for progress estimation. Furthermore, our architecture constructs a structured temporal input by explicitly anchoring the video sequence between initial and current state images. Supported by the proposed PRIMO Dataset and Benchmark, extensive experiments across diverse in-domain environments and out-of-domain real-world humanoid scenarios demonstrate that PRIMO R1 achieves state-of-the-art performance. Quantitatively, our 7B model achieves a 50% reduction in the mean absolute error of specialized reasoning baselines, demonstrating significant relative accuracy improvements over 72B-scale general MLLMs. Furthermore, PRIMO R1 exhibits strong zero-shot generalization on difficult failure detection tasks. We establish state-of-the-art performance on RoboFail benchmark with 67.0% accuracy, surpassing closed-source models like OpenAI o1 by 6.0%.

OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data

Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet the development of high-performance search agents remains dominated by industrial giants due to a lack of transparent, high-quality training data. This persistent data scarcity has fundamentally hindered the progress of the broader research community in developing and innovating within this domain. To bridge this gap, we introduce OpenSeeker, the first fully open-source search agent (i.e., model and data) that achieves frontier-level performance through two core technical innovations: (1) Fact-grounded scalable controllable QA synthesis, which reverse-engineers the web graph via topological expansion and entity obfuscation to generate complex, multi-hop reasoning tasks with controllable coverage and complexity. (2) Denoised trajectory synthesis, which employs a retrospective summarization mechanism to denoise the trajectory, therefore promoting the teacher LLMs to generate high-quality actions. Experimental results demonstrate that OpenSeeker, trained (a single training run) on only 11.7k synthesized samples, achieves state-of-the-art performance across multiple benchmarks including BrowseComp, BrowseComp-ZH, xbench-DeepSearch, and WideSearch. Notably, trained with simple SFT, OpenSeeker significantly outperforms the second-best fully open-source agent DeepDive (e.g., 29.5% v.s. 15.3% on BrowseComp), and even surpasses industrial competitors such as Tongyi DeepResearch (trained via extensive continual pre-training, SFT, and RL) on BrowseComp-ZH (48.4% v.s. 46.7%). We fully open-source the complete training dataset and the model weights to democratize frontier search agent research and foster a more transparent, collaborative ecosystem.

Computational Concept of the Psyche

This article presents an overview of approaches to modeling the human psyche in the context of constructing an artificial one. Based on this overview, a concept of cognitive architecture is proposed, in which the psyche is viewed as the operating system of a living or artificial subject, comprising a space of states, including the state of needs that determine the meaning of a subject's being in relation to stimuli from the external world, and intelligence as a decision-making system regarding actions in this world to satisfy these needs. Based on this concept, a computational formalization is proposed for creating artificial general intelligence systems for an agent through experiential learning in a state space that includes agent's needs, taking into account their biological or existential significance for the intelligent agent, along with agent's sensations and actions. Thus, the problem of constructing artificial general intelligence is formalized as a system for making optimal decisions in the space of specific agent needs under conditions of uncertainty, maximizing success in achieving goals, minimizing existential risks, and maximizing energy efficiency. A minimal experimental implementation of the model is presented.

Physics-Informed Neural Systems for the Simulation of EUV Electromagnetic Wave Diffraction from a Lithography Mask

Physics-informed neural networks (PINNs) and neural operators (NOs) for solving the problem of diffraction of Extreme Ultraviolet (EUV) electromagnetic waves from contemporary lithography masks are presented. A novel hybrid Waveguide Neural Operator (WGNO) is introduced, based on a waveguide method with its most computationally expensive components replaced by a neural network. To evaluate performance, the accuracy and inference time of PINNs and NOs are compared against modern numerical solvers for a series of problems with known exact solutions. The emphasis is placed on investigation of solution accuracy by considered artificial neural systems for 13.5 nm and 11.2 nm wavelengths. Numerical experiments on realistic 2D and 3D masks demonstrate that PINNs and neural operators achieve competitive accuracy and significantly reduced prediction times, with the proposed WGNO architecture reaching state-of-the-art performance. The presented neural operator has pronounced generalizing properties, meaning that for unseen problem parameters it delivers a solution accuracy close to that for parameters seen in the training dataset. These results provide a highly efficient solution for accelerating the design and optimization workflows of next-generation lithography masks.

Lore: Repurposing Git Commit Messages as a Structured Knowledge Protocol for AI Coding Agents

As AI coding agents become both primary producers and consumers of source code, the software industry faces an accelerating loss of institutional knowledge. Each commit captures a code diff but discards the reasoning behind it - the constraints, rejected alternatives, and forward-looking context that shaped the decision. I term this discarded reasoning the Decision Shadow. This paper proposes Lore, a lightweight protocol that restructures commit messages - using native git trailers - into self-contained decision records carrying constraints, rejected alternatives, agent directives, and verification metadata. Lore requires no infrastructure beyond git, is queryable via a standalone CLI tool, and is discoverable by any agent capable of running shell commands. The paper formalizes the protocol, compares it against five competing approaches, stress-tests it against its strongest objections, and outlines an empirical validation path.

The PokeAgent Challenge: Competitive and Long-Context Learning at Scale

We present the PokeAgent Challenge, a large-scale benchmark for decision-making research built on Pokemon's multi-agent battle system and expansive role-playing game (RPG) environment. Partial observability, game-theoretic reasoning, and long-horizon planning remain open problems for frontier AI, yet few benchmarks stress all three simultaneously under realistic conditions. PokeAgent targets these limitations at scale through two complementary tracks: our Battling Track, which calls for strategic reasoning and generalization under partial observability in competitive Pokemon battles, and our Speedrunning Track, which requires long-horizon planning and sequential decision-making in the Pokemon RPG. Our Battling Track supplies a dataset of 20M+ battle trajectories alongside a suite of heuristic, RL, and LLM-based baselines capable of high-level competitive play. Our Speedrunning Track provides the first standardized evaluation framework for RPG speedrunning, including an open-source multi-agent orchestration system for modular, reproducible comparisons of harness-based LLM approaches. Our NeurIPS 2025 competition validates both the quality of our resources and the research community's interest in Pokemon, with over 100 teams competing across both tracks and winning solutions detailed in our paper. Participant submissions and our baselines reveal considerable gaps between generalist (LLM), specialist (RL), and elite human performance. Analysis against the BenchPress evaluation matrix shows that Pokemon battling is nearly orthogonal to standard LLM benchmarks, measuring capabilities not captured by existing suites and positioning Pokemon as an unsolved benchmark that can drive RL and LLM research forward. We transition to a living benchmark with a live leaderboard for Battling and self-contained evaluation for Speedrunning at https://pokeagentchallenge.com.

Can LLMs Model Incorrect Student Reasoning? A Case Study on Distractor Generation

Modeling plausible student misconceptions is critical for AI in education. In this work, we examine how large language models (LLMs) reason about misconceptions when generating multiple-choice distractors, a task that requires modeling incorrect yet plausible answers by coordinating solution knowledge, simulating student misconceptions, and evaluating plausibility. We introduce a taxonomy for analyzing the strategies used by state-of-the-art LLMs, examining their reasoning procedures and comparing them to established best practices in the learning sciences. Our structured analysis reveals a surprising alignment between their processes and best practices: the models typically solve the problem correctly first, then articulate and simulate multiple potential misconceptions, and finally select a set of distractors. An analysis of failure modes reveals that errors arise primarily from failures in recovering the correct solution and selecting among response candidates, rather than simulating errors or structuring the process. Consistent with these results, we find that providing the correct solution in the prompt improves alignment with human-authored distractors by 8%, highlighting the critical role of anchoring to the correct solution when generating plausible incorrect student reasoning. Overall, our analysis offers a structured and interpretable lens into LLMs' ability to model incorrect student reasoning and produce high-quality distractors.

AI Models

unsloth/Mistral-Small-4-119B-2603-GGUF


base_model:

  • mistralai/Mistral-Small-4-119B-2603 license: apache-2.0 language:
  • ar
  • en
  • fr
  • es
  • de
  • it
  • pt
  • nl
  • ja
  • ko
  • zh tags:
  • vLLM
  • unsloth

[!NOTE] Includes Unsloth chat template fixes! <br> For llama.cpp, use --jinja

<div> <p style="margin-top: 0;margin-bottom: 0;"> <em><a href="https://docs.unsloth.ai/basics/unsloth-dynamic-v2.0-gguf">Unsloth Dynamic 2.0</a> achieves superior accuracy & outperforms other leading quants.</em> </p> <div style="display: flex; gap: 5px; align-items: center; "> <a href="https://github.com/unslothai/unsloth/"> <img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="133"> </a> <a href="https://discord.gg/unsloth"> <img src="https://github.com/unslothai/unsloth/raw/main/images/Discord%20button.png" width="173"> </a> <a href="https://docs.unsloth.ai/"> <img src="https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/documentation%20green%20button.png" width="143"> </a> </div> </div>

Mistral Small 4 119B A6B

Mistral Small 4 is a powerful hybrid model capable of acting as both a general instruction model and a reasoning model. It unifies the capabilities of three different model families—Instruct, Reasoning (previously called Magistral), and Devstral—into a single, unified model.

With its multimodal capabilities, efficient architecture, and flexible mode switching, it is a powerful general-purpose model for any task. In a latency-optimized setup, Mistral Small 4 achieves a 40% reduction in end-to-end completion time, and in a throughput-optimized setup, it handles 3x more requests per second compared to Mistral Small 3.

To further improve efficiency you can either take advantages of:

Key Features

Mistral Small 4 includes the following architectural choices:

  • MoE: 128 experts, 4 active.
  • 119B parameters, with 6.5B activated per token.
  • 256k context length.
  • Multimodal input: Accepts both text and image input, with text output.
  • Instruct and Reasoning functionalities with function calls (reasoning effort configurable per request).

Mistral Small 4 offers the following capabilities:

  • Reasoning Mode: Toggle between fast instant reply mode and reasoning mode, boosting performance with test-time compute when requested.
  • Vision: Analyzes images and provides insights based on visual content, in addition to text.
  • Multilingual: Supports dozens of languages, including English, French, Spanish, German, Italian, Portuguese, Dutch, Chinese, Japanese, Korean, and Arabic.
  • System Prompt: Strong adherence and support for system prompts.
  • Agentic: Best-in-class agentic capabilities with native function calling and JSON output.
  • Speed-Optimized: Delivers best-in-class performance and speed.
  • Apache 2.0 License: Open-source license for both commercial and non-commercial use.
  • Large Context Window: Supports a 256k context window.

Use Cases

Mistral Small 4 is designed for general chat assistants, coding, agentic tasks, and reasoning tasks (with reasoning mode toggled). Its multimodal capabilities also enable document and image understanding for data extraction and analysis.

Its capabilities are ideal for:

  • Developers interested in coding and agentic capabilities for SWE automation and codebase exploration.
  • Enterprises seeking general chat assistants, agents, and document understanding.
  • Researchers leveraging its math and research capabilities.

Mistral Small 4 is also well-suited for customization and fine-tuning for more specialized tasks.

Examples

  • General chat assistant
  • Document parsing and extraction
  • Coding agent
  • Research assistant
  • Customization & fine-tuning
  • And more...

Benchmarks

Comparison with internal models

Depending on your tasks you can trigger reasoning thanks to the support of the per-request parameter reasoning_effort. Set it to:

Internal benchmark

Comparing Reasoning Models

Internal benchmark - Reasoning

Comparison with other models

Mistral Small 4 with reasoning achieves competitive scores, matching or surpassing GPT-OSS 120B across all three benchmarks while generating significantly shorter outputs. On AA LCR, Mistral Small 4 scores 0.72 with just 1.6K characters, whereas Qwen models require 3.5-4x more output (5.8-6.1K) for comparable performance. On LiveCodeBench, Mistral Small 4 outperforms GPT-OSS 120B while producing 20% less output. This efficiency reduces latency, inference costs, and improves user experience.

Comparison benchmark - LCR Comparison benchmark - LiveCodeBench Comparison benchmark - AIME25

Usage

You can find Mistral Small 4 support on multiple libraries for inference and fine-tuning. We here thank everyone contributors and maintainers that helped us making it happen.

Inference

The model can be deployed with:

For optimal performance, we recommend using the Mistral AI API if local serving is subpar.

Fine-Tuning

Fine-tune the model via:

vLLM (Recommended)

We recommend using Mistral Small 4 with the vLLM library for production-ready inference.

Installation

[!Tip] Use our custom Docker image with fixes for tool calling and reasoning parsing in vLLM, and the latest Transformers version. We are working with the vLLM team to merge these fixes soon.

Custom Docker Use the following Docker image: mistralllm/vllm-ms4:latest:

docker pull mistralllm/vllm-ms4:latest
docker run -it mistralllm/vllm-ms4:latest

Manual Install Alternatively, install vllm from this PR: Add Mistral Guidance.

Note: This PR is expected to be merged into vllm main in the next 1-2 weeks (as of 16.03.2026). Track updates here.

  1. Clone vLLM:
    git clone --branch fix_mistral_parsing https://github.com/juliendenize/vllm.git
    
  2. Install with pre-compiled kernels:
    VLLM_USE_PRECOMPILED=1 pip install --editable .
    
  3. Install transformers from main:
    uv pip install git+https://github.com/huggingface/transformers.git
    
    Ensure mistral_common >= 1.10.0 is installed:
    python -c "import mistral_common; print(mistral_common.__version__)"
    

Serve the Model

We recommend a server/client setup:

vllm serve mistralai/Mistral-Small-4-119B-2603 --max-model-len 262144 --tensor-parallel-size 2 --attention-backend FLASH_ATTN_MLA \
  --tool-call-parser mistral --enable-auto-tool-choice --reasoning-parser mistral --max_num_batched_tokens 16384 --max_num_seqs 128 \
  --gpu_memory_utilization 0.8

Ping the Server

<details> <summary>Instruction Following</summary>

Mistral Small 4 can follow your instructions to the letter.

from datetime import datetime, timedelta

from openai import OpenAI
from huggingface_hub import hf_hub_download

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

TEMP = 0.1

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id


def load_system_prompt(repo_id: str, filename: str) -> str:
    file_path = hf_hub_download(repo_id=repo_id, filename=filename)
    with open(file_path, "r") as file:
        system_prompt = file.read()
    today = datetime.today().strftime("%Y-%m-%d")
    yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
    model_name = repo_id.split("/")[-1]
    return system_prompt.format(name=model_name, today=today, yesterday=yesterday)


SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {
        "role": "user",
        "content": "Write me a sentence where every word starts with the next letter in the alphabet - start with 'a' and end with 'z'.",
    },
]

response = client.chat.completions.create(
    model=model,
    messages=messages,
    temperature=TEMP,
    reasoning_effort="none",
)

assistant_message = response.choices[0].message.content
print(assistant_message)
</details> <details> <summary>Tool Call</summary>

Let's solve some equations thanks to our simple Python calculator tool.

import json
from datetime import datetime, timedelta

from openai import OpenAI
from huggingface_hub import hf_hub_download

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

TEMP = 0.1

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id


def load_system_prompt(repo_id: str, filename: str) -> str:
    file_path = hf_hub_download(repo_id=repo_id, filename=filename)
    with open(file_path, "r") as file:
        system_prompt = file.read()
    today = datetime.today().strftime("%Y-%m-%d")
    yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
    model_name = repo_id.split("/")[-1]
    return system_prompt.format(name=model_name, today=today, yesterday=yesterday)


SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")

image_url = "https://math-coaching.com/img/fiche/46/expressions-mathematiques.jpg"


def my_calculator(expression: str) -> str:
    return str(eval(expression))


tools = [
    {
        "type": "function",
        "function": {
            "name": "my_calculator",
            "description": "A calculator that can evaluate a mathematical expression.",
            "parameters": {
                "type": "object",
                "properties": {
                    "expression": {
                        "type": "string",
                        "description": "The mathematical expression to evaluate.",
                    },
                },
                "required": ["expression"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "rewrite",
            "description": "Rewrite a given text for improved clarity",
            "parameters": {
                "type": "object",
                "properties": {
                    "text": {
                        "type": "string",
                        "description": "The input text to rewrite",
                    }
                },
            },
        },
    },
]

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Thanks to your calculator, compute the results for the equations that involve numbers displayed in the image.",
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": image_url,
                },
            },
        ],
    },
]

response = client.chat.completions.create(
    model=model,
    messages=messages,
    temperature=TEMP,
    tools=tools,
    tool_choice="auto",
    reasoning_effort="none",
)

tool_calls = response.choices[0].message.tool_calls

results = []
for tool_call in tool_calls:
    function_name = tool_call.function.name
    function_args = tool_call.function.arguments
    if function_name == "my_calculator":
        result = my_calculator(**json.loads(function_args))
        results.append(result)

messages.append({"role": "assistant", "tool_calls": tool_calls})
for tool_call, result in zip(tool_calls, results):
    messages.append(
        {
            "role": "tool",
            "tool_call_id": tool_call.id,
            "name": tool_call.function.name,
            "content": result,
        }
    )


response = client.chat.completions.create(
    model=model,
    messages=messages,
    temperature=TEMP,
    reasoning_effort="none",
)

print(response.choices[0].message.content)
</details> <details> <summary>Vision Reasoning</summary>

Let's see if the Mistral Small 4 knows when to pick a fight !

from datetime import datetime, timedelta

from openai import OpenAI
from huggingface_hub import hf_hub_download

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

TEMP = 0.1

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id


def load_system_prompt(repo_id: str, filename: str) -> str:
    file_path = hf_hub_download(repo_id=repo_id, filename=filename)
    with open(file_path, "r") as file:
        system_prompt = file.read()
    today = datetime.today().strftime("%Y-%m-%d")
    yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
    model_name = repo_id.split("/")[-1]
    return system_prompt.format(name=model_name, today=today, yesterday=yesterday)


SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")
image_url = "https://static.wikia.nocookie.net/essentialsdocs/images/7/70/Battle.png/revision/latest?cb=20220523172438"

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What action do you think I should take in this situation? List all the possible actions and explain why you think they are good or bad.",
            },
            {"type": "image_url", "image_url": {"url": image_url}},
        ],
    },
]


response = client.chat.completions.create(
    model=model,
    messages=messages,
    temperature=TEMP,
    reasoning_effort="high",
)

print(response.choices[0].message.content)
</details>

Transformers

Installation

You need to install the main branch of Transformers to use Mistral Small 4:

uv pip install git+https://github.com/huggingface/transformers.git

Inference

Note: Current implementation of Transformers does not support FP8. Weights have been stored in FP8 and updates to load them in this format are expected, in the meantime we provide BF16 quantization snippets to ease usage. As soon as support is added, we will update the following code snippet.

<details> <summary>Python Inference Snippet</summary>
from pathlib import Path

import torch
from huggingface_hub import snapshot_download
from safetensors.torch import load_file
from tqdm import tqdm

from transformers import AutoConfig, AutoProcessor, Mistral3ForConditionalGeneration


def _descale_fp8_to_bf16(tensor: torch.Tensor, scale_inv: torch.Tensor) -> torch.Tensor:
    return (tensor.to(torch.bfloat16) * scale_inv.to(torch.bfloat16)).to(torch.bfloat16)


def _resolve_model_dir(model_id: str) -> Path:
    local = Path(model_id)
    if local.is_dir():
        return local
    return Path(snapshot_download(model_id, allow_patterns=["model*.safetensors"]))


def load_and_dequantize_state_dict(model_id: str) -> dict[str, torch.Tensor]:
    model_dir = _resolve_model_dir(model_id)

    shards = sorted(model_dir.glob("model*.safetensors"))

    full_state_dict: dict[str, torch.Tensor] = {}
    for shard in tqdm(shards, desc="Loading safetensors shards"):
        full_state_dict.update(load_file(str(shard)))

    scale_suffixes = ("weight_scale_inv", "gate_up_proj_scale_inv", "down_proj_scale_inv", "up_proj_scale_inv")
    activation_scale_suffixes = ("activation_scale", "gate_up_proj_activation_scale", "down_proj_activation_scale")

    keys_to_remove: set[str] = set()
    all_keys = list(full_state_dict.keys())

    for key in tqdm(all_keys, desc="Dequantizing FP8 weights to BF16"):
        if any(key.endswith(s) for s in scale_suffixes + activation_scale_suffixes):
            continue

        for scale_suffix in scale_suffixes:
            if scale_suffix == "weight_scale_inv":
                if not key.endswith(".weight"):
                    continue
                scale_key = key.rsplit(".weight", 1)[0] + ".weight_scale_inv"
            else:
                proj_name = scale_suffix.replace("_scale_inv", "")
                if not key.endswith(f".{proj_name}"):
                    continue
                scale_key = key + "_scale_inv"

            if scale_key in full_state_dict:
                full_state_dict[key] = _descale_fp8_to_bf16(full_state_dict[key], full_state_dict[scale_key])
                keys_to_remove.add(scale_key)

    for key in full_state_dict:
        if any(key.endswith(s) for s in activation_scale_suffixes):
            keys_to_remove.add(key)

    for key in tqdm(keys_to_remove, desc="Removing scale keys"):
        del full_state_dict[key]

    return full_state_dict


def load_config_without_quantization(model_id: str) -> AutoConfig:
    config = AutoConfig.from_pretrained(model_id)

    if hasattr(config, "quantization_config"):
        del config.quantization_config

    if hasattr(config, "text_config") and hasattr(config.text_config, "quantization_config"):
        del config.text_config.quantization_config

    return config


model_id = "mistralai/Mistral-Small-4-119B-2603"

config = load_config_without_quantization(model_id)
state_dict = load_and_dequantize_state_dict(model_id)

model = Mistral3ForConditionalGeneration.from_pretrained(
    None,
    config=config,
    state_dict=state_dict,
    device_map="auto",
)

processor = AutoProcessor.from_pretrained(model_id)

image_url = "https://static.wikia.nocookie.net/essentialsdocs/images/7/70/Battle.png/revision/latest?cb=20220523172438"

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What action do you think I should take in this situation? List all the possible actions and explain why you think they are good or bad.",
            },
            {"type": "image_url", "image_url": {"url": image_url}},
        ],
    },
]

inputs = processor.apply_chat_template(
    messages, return_tensors="pt", tokenize=True, return_dict=True, reasoning_effort="high"
)
inputs = inputs.to(model.device)

output = model.generate(
    **inputs,
    max_new_tokens=1024,
)[0]

# Setting `skip_special_tokens=False` to visualize reasoning trace between [THINK] [/THINK] tags.
decoded_output = processor.decode(output[len(inputs["input_ids"][0]) :], skip_special_tokens=False)
print(decoded_output)
</details>

License

This model is licensed under the Apache 2.0 License.

You must not use this model in a manner that infringes, misappropriates, or violates any third party’s rights, including intellectual property rights.

Author: unsloth

Likes: 23

Downloads: 0

Tags: gguf, vLLM, unsloth, ar, en, fr, es, de, it, pt, nl, ja, ko, zh, base_model:mistralai/Mistral-Small-4-119B-2603, base_model:quantized:mistralai/Mistral-Small-4-119B-2603, license:apache-2.0, endpoints_compatible, region:us, imatrix, conversational

herimor/voxtream2


license: cc-by-4.0 datasets:

  • amphion/Emilia-Dataset
  • nvidia/hifitts-2 language:
  • en pipeline_tag: text-to-speech tags:
  • text-to-speech

Model Card for VoXtream2

VoXtream2 is a zero-shot full-stream TTS model with dynamic speaking-rate control that can be updated mid-utterance on the fly.

Key features

  • Dynamic speed control: Distribution matching and Classifier-free guidance allow for a fine-grained speaking rate control, which can be adjusted as the model generates speech.
  • Streaming performance: Works 4x times faster than real-time and achieves 74 ms first packet latency in a full-stream on a consumer GPU.
  • Translingual capability: Prompt text masking enables support of acoustic prompts in any language.

Model Sources

Get started

Installation

eSpeak NG phonemizer

# For Debian-like distribution (e.g. Ubuntu, Mint, etc.)
apt-get install espeak-ng
# For RedHat-like distribution (e.g. CentOS, Fedora, etc.) 
yum install espeak-ng
# For MacOS
brew install espeak-ng

Pip package

pip install "voxtream>=0.2"

Usage

  • Prompt audio: a file containing 3-10 seconds of the target voice. The maximum supported length is 20 seconds (longer audio will be trimmed).
  • Text: What you want the model to say. The maximum supported length is 1000 characters (longer text will be trimmed).
  • Speaking rate (optional): target speaking rate in syllables per second.

Output streaming

voxtream \
    --prompt-audio assets/audio/english_male.wav \
    --text "In general, however, some method is then needed to evaluate each approximation." \
    --output "output_stream.wav"

Full streaming (slow speech, 2 syllables per second)

voxtream \
    --prompt-audio assets/audio/english_female.wav \
    --text "Staff do not always do enough to prevent violence." \
    --output "full_stream_2sps.wav" \
    --full-stream \
    --spk-rate 2.0
  • Note: Initial run may take some time to download model weights and warmup model graph.

Out-of-Scope Use

Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.

Training Data

The model was trained on Emilia and HiFiTTS2 datasets. You can download preprocessed dataset here. For more details, please check our paper.

Citation

@inproceedings{torgashov2026voxtream,
  title={Vo{X}tream: Full-Stream Text-to-Speech with Extremely Low Latency},
  author={Torgashov, Nikita and Henter, Gustav Eje and Skantze, Gabriel},
  booktitle={Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2026},
  note={to appear},
  url={https://arxiv.org/abs/2509.15969}
}

@article{torgashov2026voxtream2,
  author    = {Torgashov, Nikita and Henter, Gustav Eje and Skantze, Gabriel},
  title     = {Vo{X}tream2: Full-stream TTS with dynamic speaking rate control},
  journal   = {arXiv:2603.13518},
  year      = {2026}
}

Author: herimor

Likes: 4

Downloads: 0

Tags: safetensors, text-to-speech, en, dataset:amphion/Emilia-Dataset, dataset:nvidia/hifitts-2, arxiv:2603.13518, arxiv:2509.15969, license:cc-by-4.0, region:us

AesSedai/Mistral-Small-4-119B-2603-GGUF


base_model:

  • mistralai/Mistral-Small-4-119B-2603

Description

This repo contains specialized MoE-quants for Mistral-Small-4-119B-2603. The idea being that given the huge size of the FFN tensors compared to the rest of the tensors in the model, it should be possible to achieve a better quality while keeping the overall size of the entire model smaller compared to a similar naive quantization. To that end, the quantization type default is kept in high quality and the FFN UP + FFN GATE tensors are quanted down along with the FFN DOWN tensors.

Notes

I had made a Q4_K_M mix, but it kept returning NaN's for the KLD / PPL testing so I'm looking more into that.

| Quant | Size | Mixture | PPL | 1-(Mean PPL(Q)/PPL(base)) | KLD | | :--------- | :--------- | :------- | :------- | :------- | :------- | | Q5_K_M | 82.06 GiB (5.92 BPW) | Q8_0 / Q5_K / Q5_K / Q6_K | 5.773442 ± 0.037218 | +0.3924% | 0.054105 ± 0.000366 | | IQ4_XS | 53.09 GiB (3.83 BPW) | Q8_0 / IQ3_S / IQ3_S / IQ4_XS | 5.980665 ± 0.038900 | +3.9957% | 0.132002 ± 0.000692 | | IQ3_S | 40.89 GiB (2.95 BPW) | Q8_0 / IQ2_S / IQ2_S / IQ3_S | 6.506233 ± 0.043705 | +13.1347% | 0.261355 ± 0.001217 |

kld_graph ppl_graph

Author: AesSedai

Likes: 4

Downloads: 0

Tags: gguf, base_model:mistralai/Mistral-Small-4-119B-2603, base_model:quantized:mistralai/Mistral-Small-4-119B-2603, endpoints_compatible, region:us, imatrix, conversational

facebook/test

Author: facebook

Likes: 3

Downloads: 0

Tags: meta-ai, facebook, meta-pytorch, en, license:fair-noncommercial-research-license, region:us

unsloth/Mistral-Small-4-119B-2603


base_model:

  • mistralai/Mistral-Small-4-119B-2603 license: apache-2.0 language:
  • ar
  • en
  • fr
  • es
  • de
  • it
  • pt
  • nl
  • ja
  • ko
  • zh tags:
  • vLLM
  • unsloth

[!NOTE] Includes Unsloth chat template fixes! <br> For llama.cpp, use --jinja

<div> <p style="margin-top: 0;margin-bottom: 0;"> <em><a href="https://docs.unsloth.ai/basics/unsloth-dynamic-v2.0-gguf">Unsloth Dynamic 2.0</a> achieves superior accuracy & outperforms other leading quants.</em> </p> <div style="display: flex; gap: 5px; align-items: center; "> <a href="https://github.com/unslothai/unsloth/"> <img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="133"> </a> <a href="https://discord.gg/unsloth"> <img src="https://github.com/unslothai/unsloth/raw/main/images/Discord%20button.png" width="173"> </a> <a href="https://docs.unsloth.ai/"> <img src="https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/documentation%20green%20button.png" width="143"> </a> </div> </div>

Mistral Small 4 119B A6B

Mistral Small 4 is a powerful hybrid model capable of acting as both a general instruction model and a reasoning model. It unifies the capabilities of three different model families—Instruct, Reasoning (previously called Magistral), and Devstral—into a single, unified model.

With its multimodal capabilities, efficient architecture, and flexible mode switching, it is a powerful general-purpose model for any task. In a latency-optimized setup, Mistral Small 4 achieves a 40% reduction in end-to-end completion time, and in a throughput-optimized setup, it handles 3x more requests per second compared to Mistral Small 3.

To further improve efficiency you can either take advantages of:

Key Features

Mistral Small 4 includes the following architectural choices:

  • MoE: 128 experts, 4 active.
  • 119B parameters, with 6.5B activated per token.
  • 256k context length.
  • Multimodal input: Accepts both text and image input, with text output.
  • Instruct and Reasoning functionalities with function calls (reasoning effort configurable per request).

Mistral Small 4 offers the following capabilities:

  • Reasoning Mode: Toggle between fast instant reply mode and reasoning mode, boosting performance with test-time compute when requested.
  • Vision: Analyzes images and provides insights based on visual content, in addition to text.
  • Multilingual: Supports dozens of languages, including English, French, Spanish, German, Italian, Portuguese, Dutch, Chinese, Japanese, Korean, and Arabic.
  • System Prompt: Strong adherence and support for system prompts.
  • Agentic: Best-in-class agentic capabilities with native function calling and JSON output.
  • Speed-Optimized: Delivers best-in-class performance and speed.
  • Apache 2.0 License: Open-source license for both commercial and non-commercial use.
  • Large Context Window: Supports a 256k context window.

Use Cases

Mistral Small 4 is designed for general chat assistants, coding, agentic tasks, and reasoning tasks (with reasoning mode toggled). Its multimodal capabilities also enable document and image understanding for data extraction and analysis.

Its capabilities are ideal for:

  • Developers interested in coding and agentic capabilities for SWE automation and codebase exploration.
  • Enterprises seeking general chat assistants, agents, and document understanding.
  • Researchers leveraging its math and research capabilities.

Mistral Small 4 is also well-suited for customization and fine-tuning for more specialized tasks.

Examples

  • General chat assistant
  • Document parsing and extraction
  • Coding agent
  • Research assistant
  • Customization & fine-tuning
  • And more...

Benchmarks

Comparison with internal models

Depending on your tasks you can trigger reasoning thanks to the support of the per-request parameter reasoning_effort. Set it to:

Internal benchmark

Comparing Reasoning Models

Internal benchmark - Reasoning

Comparison with other models

Mistral Small 4 with reasoning achieves competitive scores, matching or surpassing GPT-OSS 120B across all three benchmarks while generating significantly shorter outputs. On AA LCR, Mistral Small 4 scores 0.72 with just 1.6K characters, whereas Qwen models require 3.5-4x more output (5.8-6.1K) for comparable performance. On LiveCodeBench, Mistral Small 4 outperforms GPT-OSS 120B while producing 20% less output. This efficiency reduces latency, inference costs, and improves user experience.

Comparison benchmark - LCR Comparison benchmark - LiveCodeBench Comparison benchmark - AIME25

Usage

You can find Mistral Small 4 support on multiple libraries for inference and fine-tuning. We here thank everyone contributors and maintainers that helped us making it happen.

Inference

The model can be deployed with:

For optimal performance, we recommend using the Mistral AI API if local serving is subpar.

Fine-Tuning

Fine-tune the model via:

vLLM (Recommended)

We recommend using Mistral Small 4 with the vLLM library for production-ready inference.

Installation

[!Tip] Use our custom Docker image with fixes for tool calling and reasoning parsing in vLLM, and the latest Transformers version. We are working with the vLLM team to merge these fixes soon.

Custom Docker Use the following Docker image: mistralllm/vllm-ms4:latest:

docker pull mistralllm/vllm-ms4:latest
docker run -it mistralllm/vllm-ms4:latest

Manual Install Alternatively, install vllm from this PR: Add Mistral Guidance.

Note: This PR is expected to be merged into vllm main in the next 1-2 weeks (as of 16.03.2026). Track updates here.

  1. Clone vLLM:
    git clone --branch fix_mistral_parsing https://github.com/juliendenize/vllm.git
    
  2. Install with pre-compiled kernels:
    VLLM_USE_PRECOMPILED=1 pip install --editable .
    
  3. Install transformers from main:
    uv pip install git+https://github.com/huggingface/transformers.git
    
    Ensure mistral_common >= 1.10.0 is installed:
    python -c "import mistral_common; print(mistral_common.__version__)"
    

Serve the Model

We recommend a server/client setup:

vllm serve mistralai/Mistral-Small-4-119B-2603 --max-model-len 262144 --tensor-parallel-size 2 --attention-backend FLASH_ATTN_MLA \
  --tool-call-parser mistral --enable-auto-tool-choice --reasoning-parser mistral --max_num_batched_tokens 16384 --max_num_seqs 128 \
  --gpu_memory_utilization 0.8

Ping the Server

<details> <summary>Instruction Following</summary>

Mistral Small 4 can follow your instructions to the letter.

from datetime import datetime, timedelta

from openai import OpenAI
from huggingface_hub import hf_hub_download

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

TEMP = 0.1

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id


def load_system_prompt(repo_id: str, filename: str) -> str:
    file_path = hf_hub_download(repo_id=repo_id, filename=filename)
    with open(file_path, "r") as file:
        system_prompt = file.read()
    today = datetime.today().strftime("%Y-%m-%d")
    yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
    model_name = repo_id.split("/")[-1]
    return system_prompt.format(name=model_name, today=today, yesterday=yesterday)


SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {
        "role": "user",
        "content": "Write me a sentence where every word starts with the next letter in the alphabet - start with 'a' and end with 'z'.",
    },
]

response = client.chat.completions.create(
    model=model,
    messages=messages,
    temperature=TEMP,
    reasoning_effort="none",
)

assistant_message = response.choices[0].message.content
print(assistant_message)
</details> <details> <summary>Tool Call</summary>

Let's solve some equations thanks to our simple Python calculator tool.

import json
from datetime import datetime, timedelta

from openai import OpenAI
from huggingface_hub import hf_hub_download

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

TEMP = 0.1

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id


def load_system_prompt(repo_id: str, filename: str) -> str:
    file_path = hf_hub_download(repo_id=repo_id, filename=filename)
    with open(file_path, "r") as file:
        system_prompt = file.read()
    today = datetime.today().strftime("%Y-%m-%d")
    yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
    model_name = repo_id.split("/")[-1]
    return system_prompt.format(name=model_name, today=today, yesterday=yesterday)


SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")

image_url = "https://math-coaching.com/img/fiche/46/expressions-mathematiques.jpg"


def my_calculator(expression: str) -> str:
    return str(eval(expression))


tools = [
    {
        "type": "function",
        "function": {
            "name": "my_calculator",
            "description": "A calculator that can evaluate a mathematical expression.",
            "parameters": {
                "type": "object",
                "properties": {
                    "expression": {
                        "type": "string",
                        "description": "The mathematical expression to evaluate.",
                    },
                },
                "required": ["expression"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "rewrite",
            "description": "Rewrite a given text for improved clarity",
            "parameters": {
                "type": "object",
                "properties": {
                    "text": {
                        "type": "string",
                        "description": "The input text to rewrite",
                    }
                },
            },
        },
    },
]

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Thanks to your calculator, compute the results for the equations that involve numbers displayed in the image.",
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": image_url,
                },
            },
        ],
    },
]

response = client.chat.completions.create(
    model=model,
    messages=messages,
    temperature=TEMP,
    tools=tools,
    tool_choice="auto",
    reasoning_effort="none",
)

tool_calls = response.choices[0].message.tool_calls

results = []
for tool_call in tool_calls:
    function_name = tool_call.function.name
    function_args = tool_call.function.arguments
    if function_name == "my_calculator":
        result = my_calculator(**json.loads(function_args))
        results.append(result)

messages.append({"role": "assistant", "tool_calls": tool_calls})
for tool_call, result in zip(tool_calls, results):
    messages.append(
        {
            "role": "tool",
            "tool_call_id": tool_call.id,
            "name": tool_call.function.name,
            "content": result,
        }
    )


response = client.chat.completions.create(
    model=model,
    messages=messages,
    temperature=TEMP,
    reasoning_effort="none",
)

print(response.choices[0].message.content)
</details> <details> <summary>Vision Reasoning</summary>

Let's see if the Mistral Small 4 knows when to pick a fight !

from datetime import datetime, timedelta

from openai import OpenAI
from huggingface_hub import hf_hub_download

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

TEMP = 0.1

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id


def load_system_prompt(repo_id: str, filename: str) -> str:
    file_path = hf_hub_download(repo_id=repo_id, filename=filename)
    with open(file_path, "r") as file:
        system_prompt = file.read()
    today = datetime.today().strftime("%Y-%m-%d")
    yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
    model_name = repo_id.split("/")[-1]
    return system_prompt.format(name=model_name, today=today, yesterday=yesterday)


SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")
image_url = "https://static.wikia.nocookie.net/essentialsdocs/images/7/70/Battle.png/revision/latest?cb=20220523172438"

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What action do you think I should take in this situation? List all the possible actions and explain why you think they are good or bad.",
            },
            {"type": "image_url", "image_url": {"url": image_url}},
        ],
    },
]


response = client.chat.completions.create(
    model=model,
    messages=messages,
    temperature=TEMP,
    reasoning_effort="high",
)

print(response.choices[0].message.content)
</details>

Transformers

Installation

You need to install the main branch of Transformers to use Mistral Small 4:

uv pip install git+https://github.com/huggingface/transformers.git

Inference

Note: Current implementation of Transformers does not support FP8. Weights have been stored in FP8 and updates to load them in this format are expected, in the meantime we provide BF16 quantization snippets to ease usage. As soon as support is added, we will update the following code snippet.

<details> <summary>Python Inference Snippet</summary>
from pathlib import Path

import torch
from huggingface_hub import snapshot_download
from safetensors.torch import load_file
from tqdm import tqdm

from transformers import AutoConfig, AutoProcessor, Mistral3ForConditionalGeneration


def _descale_fp8_to_bf16(tensor: torch.Tensor, scale_inv: torch.Tensor) -> torch.Tensor:
    return (tensor.to(torch.bfloat16) * scale_inv.to(torch.bfloat16)).to(torch.bfloat16)


def _resolve_model_dir(model_id: str) -> Path:
    local = Path(model_id)
    if local.is_dir():
        return local
    return Path(snapshot_download(model_id, allow_patterns=["model*.safetensors"]))


def load_and_dequantize_state_dict(model_id: str) -> dict[str, torch.Tensor]:
    model_dir = _resolve_model_dir(model_id)

    shards = sorted(model_dir.glob("model*.safetensors"))

    full_state_dict: dict[str, torch.Tensor] = {}
    for shard in tqdm(shards, desc="Loading safetensors shards"):
        full_state_dict.update(load_file(str(shard)))

    scale_suffixes = ("weight_scale_inv", "gate_up_proj_scale_inv", "down_proj_scale_inv", "up_proj_scale_inv")
    activation_scale_suffixes = ("activation_scale", "gate_up_proj_activation_scale", "down_proj_activation_scale")

    keys_to_remove: set[str] = set()
    all_keys = list(full_state_dict.keys())

    for key in tqdm(all_keys, desc="Dequantizing FP8 weights to BF16"):
        if any(key.endswith(s) for s in scale_suffixes + activation_scale_suffixes):
            continue

        for scale_suffix in scale_suffixes:
            if scale_suffix == "weight_scale_inv":
                if not key.endswith(".weight"):
                    continue
                scale_key = key.rsplit(".weight", 1)[0] + ".weight_scale_inv"
            else:
                proj_name = scale_suffix.replace("_scale_inv", "")
                if not key.endswith(f".{proj_name}"):
                    continue
                scale_key = key + "_scale_inv"

            if scale_key in full_state_dict:
                full_state_dict[key] = _descale_fp8_to_bf16(full_state_dict[key], full_state_dict[scale_key])
                keys_to_remove.add(scale_key)

    for key in full_state_dict:
        if any(key.endswith(s) for s in activation_scale_suffixes):
            keys_to_remove.add(key)

    for key in tqdm(keys_to_remove, desc="Removing scale keys"):
        del full_state_dict[key]

    return full_state_dict


def load_config_without_quantization(model_id: str) -> AutoConfig:
    config = AutoConfig.from_pretrained(model_id)

    if hasattr(config, "quantization_config"):
        del config.quantization_config

    if hasattr(config, "text_config") and hasattr(config.text_config, "quantization_config"):
        del config.text_config.quantization_config

    return config


model_id = "mistralai/Mistral-Small-4-119B-2603"

config = load_config_without_quantization(model_id)
state_dict = load_and_dequantize_state_dict(model_id)

model = Mistral3ForConditionalGeneration.from_pretrained(
    None,
    config=config,
    state_dict=state_dict,
    device_map="auto",
)

processor = AutoProcessor.from_pretrained(model_id)

image_url = "https://static.wikia.nocookie.net/essentialsdocs/images/7/70/Battle.png/revision/latest?cb=20220523172438"

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What action do you think I should take in this situation? List all the possible actions and explain why you think they are good or bad.",
            },
            {"type": "image_url", "image_url": {"url": image_url}},
        ],
    },
]

inputs = processor.apply_chat_template(
    messages, return_tensors="pt", tokenize=True, return_dict=True, reasoning_effort="high"
)
inputs = inputs.to(model.device)

output = model.generate(
    **inputs,
    max_new_tokens=1024,
)[0]

# Setting `skip_special_tokens=False` to visualize reasoning trace between [THINK] [/THINK] tags.
decoded_output = processor.decode(output[len(inputs["input_ids"][0]) :], skip_special_tokens=False)
print(decoded_output)
</details>

License

This model is licensed under the Apache 2.0 License.

You must not use this model in a manner that infringes, misappropriates, or violates any third party’s rights, including intellectual property rights.

Author: unsloth

Likes: 3

Downloads: 24

Tags: safetensors, mistral3, vLLM, unsloth, ar, en, fr, es, de, it, pt, nl, ja, ko, zh, base_model:mistralai/Mistral-Small-4-119B-2603, base_model:quantized:mistralai/Mistral-Small-4-119B-2603, license:apache-2.0, fp8, region:us

mmnga-o/RakutenAI-3.0-gguf


license: unknown language:

  • ja datasets:
  • TFMC/imatrix-dataset-for-japanese-llm base_model:
  • Rakuten/RakutenAI-3.0

RakutenAI-3.0-gguf

Rakutenさんが公開しているRakutenAI-3.0のggufフォーマット変換版です。

imatrixのデータはTFMC/imatrix-dataset-for-japanese-llmを使用して作成しました。

Usage

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release
build/bin/llama-cli -m 'RakutenAI-3.0-gguf' -n 128 -c 128 -p 'あなたはプロの料理人です。レシピを教えて' -cnv

Author: mmnga-o

Likes: 2

Downloads: 0

Tags: gguf, ja, dataset:TFMC/imatrix-dataset-for-japanese-llm, base_model:Rakuten/RakutenAI-3.0, base_model:quantized:Rakuten/RakutenAI-3.0, license:unknown, endpoints_compatible, region:us, conversational

meangrinch/Qwen3.5-35B-A3B-heretic-v3-GGUF


base_model: meangrinch/Qwen3.5-35B-A3B-heretic-v3 tags:

  • gguf
  • heretic
  • abliteration
  • uncensored license: apache-2.0 language:
  • en

Qwen3.5-35B-A3B-heretic-v3-GGUF

GGUF quantizations of the abliterated weights.

Files

| Quant | imatrix | UD map | |---|---|---| | Q8_0 | No | No | | Q6_K_XL | Yes | Yes | | Q4_K_XL | Yes | Yes | | IQ4_NL | Yes | Yes | | IQ4_XS | Yes | Yes |

Source

Related

Author: meangrinch

Likes: 2

Downloads: 0

Tags: gguf, heretic, abliteration, uncensored, en, base_model:meangrinch/Qwen3.5-35B-A3B-heretic-v3, base_model:quantized:meangrinch/Qwen3.5-35B-A3B-heretic-v3, license:apache-2.0, endpoints_compatible, region:us, imatrix, conversational

Intel/MiroThinker-1.7-int4-AutoRound


base_model:

  • miromind-ai/MiroThinker-1.7 pipeline_tag: text-generation

Model Details

This model is an int4 model with group_size 128 and symmetric quantization of miromind-ai/MiroThinker-1.7 generated by intel/auto-round. Please follow the license of the original model.

How to Use

VLLM Usage

vllm serve Intel/MiroThinker-1.7-int4-AutoRound \
    --host localhost \
    --dtype bfloat16

Generate the Model

auto-round --model_name miromind-ai/MiroThinker-1.7 --bits 4 --iters 200 --output_dir MiroThinker-1.7-int4-AutoRound

Ethical Considerations and Limitations

The model can produce factually incorrect output, and should not be relied on to produce factually accurate information. Because of the limitations of the pretrained model and the finetuning datasets, it is possible that this model could generate lewd, biased or otherwise offensive outputs. Therefore, before deploying any applications of the model, developers should perform safety testing.

Caveats and Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. Here are a couple of useful links to learn more about Intel's AI software:

Disclaimer

The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please consult an attorney before using this model for commercial purposes.

Cite

@article{cheng2023optimize,
  title={Optimize weight rounding via signed gradient descent for the quantization of llms},
  author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi},
  journal={arXiv preprint arXiv:2309.05516},
  year={2023}
}

arxiv github

Author: Intel

Likes: 2

Downloads: 9

Tags: safetensors, qwen3_moe, text-generation, conversational, arxiv:2309.05516, base_model:miromind-ai/MiroThinker-1.7, base_model:quantized:miromind-ai/MiroThinker-1.7, 4-bit, auto-round, region:us

jackcloudman/Leanstral-2603-GGUF


license: apache-2.0 base_model: mistralai/Leanstral-2603 tags:

  • gguf
  • llama-cpp
  • mistral
  • moe
  • lean4
  • math
  • deepseek2 quantized_by: jackcloudman model_type: deepseek2

Leanstral 119B A6B - GGUF

GGUF quantizations of mistralai/Leanstral-2603 for use with llama.cpp.

Leanstral is the first open-source code agent designed for Lean 4, a proof assistant for formal mathematics and software verification. Built as part of the Mistral Small 4 family, it combines multimodal capabilities with an efficient MoE + MLA architecture.

Available Quantizations

| File | Quant | Size | Description | |------|-------|------|-------------| | mistralai_Leanstral-128x3.9B-2603-Q4_K_M.gguf | Q4_K_M | 68 GB | Best balance of quality and size. Runs on 2x RTX 4090 + RAM offload | | mistralai_Leanstral-128x3.9B-2603-Q8_0.gguf | Q8_0 | 118 GB | Near-lossless. Good base for custom requantization |

Architecture

  • Type: Mixture of Experts (MoE) + Multi-head Latent Attention (MLA)
  • GGUF arch: deepseek2 (Mistral 4 uses the same architecture as DeepSeek V3)
  • Total parameters: 119B (6.5B active per token)
  • Experts: 128 routed + 1 shared, 4 active per token
  • MLA: q_lora_rank=1024, kv_lora_rank=256, qk_rope_head_dim=64
  • Context: Up to 1M tokens (256k recommended)
  • RoPE: YaRN scaling (factor=128, original_ctx=8192)
  • Vocab: 131,072 tokens (Tekken tokenizer)

How to Run

llama-server (recommended)

./llama-server \
  -m mistralai_Leanstral-128x3.9B-2603-Q4_K_M.gguf \
  -fit on -fa on \
  --host 0.0.0.0 \
  --ctx-size 128000 \
  --jinja \
  --chat-template-file chat_template.jinja

Note: You need a chat template that supports [THINK] blocks for reasoning. Download the template from the original model repo.

Reasoning

The model supports reasoning_effort via the chat template:

  • "high" - Enables thinking (recommended for Lean 4 proofs and complex tasks)
  • "none" - Direct answers without reasoning

Pass reasoning_effort in your API request body, or modify the chat template default.

Performance

On 2x RTX 4090 (48GB VRAM) + 192GB RAM with Q4_K_M:

  • ~34 tokens/s generation speed
  • Model splits between GPU and system RAM automatically with -fit on

Conversion Details

  • Source: FP8 (e4m3) consolidated weights from mistralai/Leanstral-2603
  • Pipeline: FP8 consolidated → dequant to BF16 → Q8_0 GGUF → Q4_K_M GGUF (with --allow-requantize)
  • Converter: convert_hf_to_gguf.py with --mistral-format flag (llama.cpp)
  • Tokenizer: Tekken v15 (requires mistral-common >= 1.10.0 for conversion)

License

Apache 2.0 - same as the original model.

Credits

Author: jackcloudman

Likes: 2

Downloads: 5

Tags: gguf, llama-cpp, mistral, moe, lean4, math, deepseek2, base_model:mistralai/Leanstral-2603, base_model:quantized:mistralai/Leanstral-2603, license:apache-2.0, endpoints_compatible, region:us

MikeRoz/Anubis-70B-v1.2-exl3

Author: MikeRoz

Likes: 1

Downloads: 0

Tags: region:us