Todays AI Summary

AI Developments: Instruction-Based Video Editing and Enhanced LLM Reasoning

Today's AI landscape features advancements in video editing, language model reasoning, and more efficient reinforcement learning techniques. Here's a summary of the key developments:

Research Papers

  • EasyV2V: High-Quality Instruction-Based Video Editing: This paper introduces a framework for instruction-based video editing that addresses challenges in consistency, control, and generalization. EasyV2V leverages existing text-to-video models and simplifies the design through sequence concatenation and LoRA fine-tuning. It supports flexible inputs and achieves state-of-the-art results, surpassing concurrent and commercial systems.
  • DVGT: Driving Visual Geometry Transformer: This paper presents a model for reconstructing 3D scene geometry from visual inputs for autonomous driving. DVGT reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs. The model is trained on a large mixture of driving datasets and outperforms existing models on various scenarios.
  • AuditDM: Auditing Models for Capability Gap Discovery and Rectification: This paper introduces an automated framework that actively discovers and rectifies MLLM failure modes by auditing their divergence. AuditDM fine-tunes an MLLM as an auditor via reinforcement learning to generate challenging questions and counterfactual images that maximize disagreement among target models. Fine-tuning on these discoveries consistently improves all models across 16 benchmarks.
  • Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning: This paper introduces an on-policy joint training framework designed to enhance reasoning by co-evolving an LLM reasoner and an LLM-based discriminator through adversarial reinforcement learning. The method delivers consistent gains over strong baselines with standard RL post-training across various mathematical benchmarks.
  • Exploration v.s. Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward: This paper examines the exploration-exploitation trade-off in reinforcement learning with verifiable rewards (RLVR), a framework for improving the reasoning of Large Language Models (LLMs). The results show that clipping bias under spurious rewards reduces policy entropy, leading to more confident and deterministic outputs, while entropy minimization alone is insufficient for improvement.
  • Posterior Behavioral Cloning: Pretraining BC Policies for Efficient RL Finetuning: This paper introduces Posterior Behavioral Cloning (PostBC), a method for pretraining policies that ensures coverage over the demonstrator's actions, enabling more effective finetuning. PostBC is practically implementable with modern generative models in robotic control domains and leads to significantly improved RL finetuning performance.
  • Flowing from Reasoning to Motion: Learning 3D Hand Trajectory Prediction from Egocentric Human Interaction Videos: This paper presents the EgoMAN dataset, a large-scale egocentric dataset for interaction stage-aware 3D hand trajectory prediction. It also introduces the EgoMAN model, a reasoning-to-motion framework that links vision-language reasoning and motion generation via a trajectory-token interface.
  • Impacts of Racial Bias in Historical Training Data for News AI: This paper investigates racial bias in a multi-label classifier trained on the New York Times Annotated Corpus. The study reveals how embedded biases in a model can lead to unexpected outputs in newsroom settings.
  • LinkedOut: Linking World Knowledge Representation Out of Video LLM for Next-Generation Video Recommendation: This paper presents LinkedOut, a representation that extracts VLLM world knowledge directly from video to enable fast inference, supports multi-video histories, and removes the language bottleneck. LinkedOut achieves state-of-the-art results on standard benchmarks.
  • Training Together, Diagnosing Better: Federated Learning for Collagen VI-Related Dystrophies: This paper reports a novel global FL initiative using the Sherpa.ai FL platform, which leverages FL across distributed datasets in two international organizations for the diagnosis of COL6-RD. This new approach achieved an F1-score of 0.82, outperforming single-organization models (0.57-0.75).

Models

  • beyoru/BronCode-Thinker: A transformer-based model for text generation and conversational AI, fine-tuned from Qwen/Qwen3-4B-Thinking-2507.
  • Comfy-Org/NewBie-image-Exp0.1_repackaged: A diffusion model for image generation, designed for use with ComfyUI.
  • NikolayKozloff/kona2-small-3.8B-Q8_0-GGUF: A GGUF format conversion of the tbilisi-ai-lab/kona2-small-3.8B model, optimized for use with llama

AI Papers for 2026-02-04

Reward-free Alignment for Conflicting Objectives

Direct alignment methods are increasingly used to align large language models (LLMs) with human preferences. However, many real-world alignment problems involve multiple conflicting objectives, where naive aggregation of preferences can lead to unstable training and poor trade-offs. In particular, weighted loss methods may fail to identify update directions that simultaneously improve all objectives, and existing multi-objective approaches often rely on explicit reward models, introducing additional complexity and distorting user-specified preferences. The contributions of this paper are two-fold. First, we propose a Reward-free Alignment framework for Conflicted Objectives (RACO) that directly leverages pairwise preference data and resolves gradient conflicts via a novel clipped variant of conflict-averse gradient descent. We provide convergence guarantees to Pareto-critical points that respect user-specified objective weights, and further show that clipping can strictly improve convergence rate in the two-objective setting. Second, we improve our method using some heuristics and conduct experiments to demonstrate the compatibility of the proposed framework for LLM alignment. Both qualitative and quantitative evaluations on multi-objective summarization and safety alignment tasks across multiple LLM families (Qwen 3, Llama 3, Gemma 3) show that our method consistently achieves better Pareto trade-offs compared to existing multi-objective alignment baselines.

PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss

Pixel diffusion generates images directly in pixel space in an end-to-end manner, avoiding the artifacts and bottlenecks introduced by VAEs in two-stage latent diffusion. However, it is challenging to optimize high-dimensional pixel manifolds that contain many perceptually irrelevant signals, leaving existing pixel diffusion methods lagging behind latent diffusion models. We propose PixelGen, a simple pixel diffusion framework with perceptual supervision. Instead of modeling the full image manifold, PixelGen introduces two complementary perceptual losses to guide diffusion model towards learning a more meaningful perceptual manifold. An LPIPS loss facilitates learning better local patterns, while a DINO-based perceptual loss strengthens global semantics. With perceptual supervision, PixelGen surpasses strong latent diffusion baselines. It achieves an FID of 5.11 on ImageNet-256 without classifier-free guidance using only 80 training epochs, and demonstrates favorable scaling performance on large-scale text-to-image generation with a GenEval score of 0.79. PixelGen requires no VAEs, no latent representations, and no auxiliary stages, providing a simpler yet more powerful generative paradigm. Codes are publicly available at https://github.com/Zehong-Ma/PixelGen.

RE-TRAC: REcursive TRAjectory Compression for Deep Search Agents

LLM-based deep research agents are largely built on the ReAct framework. This linear design makes it difficult to revisit earlier states, branch into alternative search directions, or maintain global awareness under long contexts, often leading to local optima, redundant exploration, and inefficient search. We propose Re-TRAC, an agentic framework that performs cross-trajectory exploration by generating a structured state representation after each trajectory to summarize evidence, uncertainties, failures, and future plans, and conditioning subsequent trajectories on this state representation. This enables iterative reflection and globally informed planning, reframing research as a progressive process. Empirical results show that Re-TRAC consistently outperforms ReAct by 15-20% on BrowseComp with frontier LLMs. For smaller models, we introduce Re-TRAC-aware supervised fine-tuning, achieving state-of-the-art performance at comparable scales. Notably, Re-TRAC shows a monotonic reduction in tool calls and token usage across rounds, indicating progressively targeted exploration driven by cross-trajectory reflection rather than redundant search.

Flow Policy Gradients for Robot Control

Likelihood-based policy gradient methods are the dominant approach for training robot control policies from rewards. These methods rely on differentiable action likelihoods, which constrain policy outputs to simple distributions like Gaussians. In this work, we show how flow matching policy gradients -- a recent framework that bypasses likelihood computation -- can be made effective for training and fine-tuning more expressive policies in challenging robot control settings. We introduce an improved objective that enables success in legged locomotion, humanoid motion tracking, and manipulation tasks, as well as robust sim-to-real transfer on two humanoid robots. We then present ablations and analysis on training dynamics. Results show how policies can exploit the flow representation for exploration when training from scratch, as well as improved fine-tuning robustness over baselines.

AgentRx: Diagnosing AI Agent Failures from Execution Trajectories

AI agents often fail in ways that are difficult to localize because executions are probabilistic, long-horizon, multi-agent, and mediated by noisy tool outputs. We address this gap by manually annotating failed agent runs and release a novel benchmark of 115 failed trajectories spanning structured API workflows, incident management, and open-ended web/file tasks. Each trajectory is annotated with a critical failure step and a category from a grounded-theory derived, cross-domain failure taxonomy. To mitigate the human cost of failure attribution, we present AGENTRX, an automated domain-agnostic diagnostic framework that pinpoints the critical failure step in a failed agent trajectory. It synthesizes constraints, evaluates them step-by-step, and produces an auditable validation log of constraint violations with associated evidence; an LLM-based judge uses this log to localize the critical step and category. Our framework improves step localization and failure attribution over existing baselines across three domains.

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

Most Large Language Model (LLM) agent memory systems rely on a small set of static, hand-designed operations for extracting memory. These fixed procedures hard-code human priors about what to store and how to revise memory, making them rigid under diverse interaction patterns and inefficient on long histories. To this end, we present \textbf{MemSkill}, which reframes these operations as learnable and evolvable memory skills, structured and reusable routines for extracting, consolidating, and pruning information from interaction traces. Inspired by the design philosophy of agent skills, MemSkill employs a \emph{controller} that learns to select a small set of relevant skills, paired with an LLM-based \emph{executor} that produces skill-guided memories. Beyond learning skill selection, MemSkill introduces a \emph{designer} that periodically reviews hard cases where selected skills yield incorrect or incomplete memories, and evolves the skill set by proposing refinements and new skills. Together, MemSkill forms a closed-loop procedure that improves both the skill-selection policy and the skill set itself. Experiments on LoCoMo, LongMemEval, HotpotQA, and ALFWorld demonstrate that MemSkill improves task performance over strong baselines and generalizes well across settings. Further analyses shed light on how skills evolve, offering insights toward more adaptive, self-evolving memory management for LLM agents.

Multi-head automated segmentation by incorporating detection head into the contextual layer neural network

Deep learning based auto segmentation is increasingly used in radiotherapy, but conventional models often produce anatomically implausible false positives, or hallucinations, in slices lacking target structures. We propose a gated multi-head Transformer architecture based on Swin U-Net, augmented with inter-slice context integration and a parallel detection head, which jointly performs slice-level structure detection via a multi-layer perceptron and pixel-level segmentation through a context-enhanced stream. Detection outputs gate the segmentation predictions to suppress false positives in anatomically invalid slices, and training uses slice-wise Tversky loss to address class imbalance. Experiments on the Prostate-Anatomical-Edge-Cases dataset from The Cancer Imaging Archive demonstrate that the gated model substantially outperforms a non-gated segmentation-only baseline, achieving a mean Dice loss of $0.013 \pm 0.036$ versus $0.732 \pm 0.314$, with detection probabilities strongly correlated with anatomical presence, effectively eliminating spurious segmentations. In contrast, the non-gated model exhibited higher variability and persistent false positives across all slices. These results indicate that detection-based gating enhances robustness and anatomical plausibility in automated segmentation applications, reducing hallucinated predictions without compromising segmentation quality in valid slices, and offers a promising approach for improving the reliability of clinical radiotherapy auto-contouring workflows.

Breaking the Reversal Curse in Autoregressive Language Models via Identity Bridge

Autoregressive large language models (LLMs) have achieved remarkable success in many complex tasks, yet they can still fail in very simple logical reasoning such as the "reversal curse" -- when trained on forward knowledge data of the form "$A \rightarrow B$" (e.g., Alice's husband is Bob), the model is unable to deduce the reversal knowledge "$B \leftarrow A$" (e.g., Bob's wife is Alice) during test. Extensive prior research suggests that this failure is an inherent, fundamental limit of autoregressive causal LLMs, indicating that these models tend to memorize factual-level knowledge rather than capture higher-level rules. In this paper, we challenge this view by showing that this seemingly fundamental limit can be mitigated by slightly tweaking the training data with a simple regularization data recipe called the Identity Bridge of the form "$A \to A$" (e.g., The name of Alice is Alice). Theoretically, we prove that under this recipe, even a one-layer transformer can break the reversal curse by analyzing the implicit bias of gradient descent. Empirically, we show that a 1B pretrained language model finetuned with the proposed data recipe achieves a 40% success rate on reversal tasks, in stark contrast to a near-zero success rate when trained solely on forward-knowledge data. Our work provides a novel theoretical foundation for the reversal curse and offers a principled, low-cost path to encouraging LLMs to learn higher-level rules from data.

Avenir-Web: Human-Experience-Imitating Multimodal Web Agents with Mixture of Grounding Experts

Despite advances in multimodal large language models, autonomous web agents still struggle to reliably execute long-horizon tasks on complex and dynamic web interfaces. Existing agents often suffer from inaccurate element grounding, the absence of site-specific procedural knowledge, and unstable long-term task tracking and memory, particularly when operating over complex Document Object Model structures. To address these limitations, we introduce Avenir-Web, a web agent that achieves a new open-source state of the art on the Online-Mind2Web benchmark in real-world deployment. Avenir-Web leverages a Mixture of Grounding Experts, Experience-Imitation Planning for incorporating procedural priors, and a task-tracking checklist combined with adaptive memory to enable robust and seamless interaction across diverse user interface paradigms. We evaluate Avenir-Web on Online-Mind2Web, a rigorous benchmark of live and user-centered web tasks. Our results demonstrate that Avenir-Web significantly surpasses prior open-source agents and attains performance parity with top-tier proprietary models, thereby establishing a new open-source state of the art for reliable web agents on live websites.

MentisOculi: Revealing the Limits of Reasoning with Mental Imagery

Frontier models are transitioning from multimodal large language models (MLLMs) that merely ingest visual information to unified multimodal models (UMMs) capable of native interleaved generation. This shift has sparked interest in using intermediate visualizations as a reasoning aid, akin to human mental imagery. Central to this idea is the ability to form, maintain, and manipulate visual representations in a goal-oriented manner. To evaluate and probe this capability, we develop MentisOculi, a procedural, stratified suite of multi-step reasoning problems amenable to visual solution, tuned to challenge frontier models. Evaluating visual strategies ranging from latent tokens to explicit generated imagery, we find they generally fail to improve performance. Analysis of UMMs specifically exposes a critical limitation: While they possess the textual reasoning capacity to solve a task and can sometimes generate correct visuals, they suffer from compounding generation errors and fail to leverage even ground-truth visualizations. Our findings suggest that despite their inherent appeal, visual thoughts do not yet benefit model reasoning. MentisOculi establishes the necessary foundation to analyze and close this gap across diverse model families.

AI Models

unsloth/Qwen3-Coder-Next-GGUF


tags:

  • qwen3_next
  • unsloth
  • qwen
  • qwen3 base_model:
  • Qwen/Qwen3-Coder-Next library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3-Coder-Next/blob/main/LICENSE pipeline_tag: text-generation

<div> <p style="margin-bottom: 0; margin-top: 0;"> <h1 style="margin-top: 0rem;">To Run Qwen3-Coder-Next locally - <a href="https://unsloth.ai/docs/models/qwen3-coder-next">Read our Guide.</a></h1> </p> <p style="margin-top: 0;margin-bottom: 0;"> <em><a href="https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs">Unsloth Dynamic 2.0</a> achieves superior accuracy & outperforms other leading quants.</em> </p> <div style="margin-top: 0;display: flex; gap: 5px; align-items: center; "> <a href="https://github.com/unslothai/unsloth/"> <img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="133"> </a> <a href="https://discord.gg/unsloth"> <img src="https://github.com/unslothai/unsloth/raw/main/images/Discord%20button.png" width="173"> </a> <a href="https://unsloth.ai/docs/models/qwen3-coder-next"> <img src="https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/documentation%20green%20button.png" width="143"> </a> </div> <h2 style="margin-top: 0rem;">Qwen3-Coder-Next Usage Guidelines</h2> </div>
  • It is recommended to have >45GB unified memory or RAM/VRAM to run 4-bit quants.
  • For best results, use any 2-bit XL quant or above (requires >30GB unified memory /combined RAM + VRAM).
  • For complete detailed instructions (sampling parameters etc.), see our guide: docs.unsloth.ai/models/qwen3-coder-next

Qwen3-Coder-Next

Highlights

Today, we're announcing Qwen3-Coder-Next, an open-weight language model designed specifically for coding agents and local development. It features the following key enhancements:

  • Super Efficient with Significant Performance: With only 3B activated parameters (80B total parameters), it achieves performance comparable to models with 10–20x more active parameters, making it highly cost-effective for agent deployment.
  • Advanced Agentic Capabilities: Through an elaborate training recipe, it excels at long-horizon reasoning, complex tool usage, and recovery from execution failures, ensuring robust performance in dynamic coding tasks.
  • Versatile Integration with Real-World IDE: Its 256k context length, combined with adaptability to various scaffold templates, enables seamless integration with different CLI/IDE platforms (e.g., Claude Code, Qwen Code, Qoder, Kilo, Trae, Cline, etc.), supporting diverse development environments.

image/jpeg

image/jpeg

Model Overview

Qwen3-Coder-Next has the following features:

  • Type: Causal Language Models
  • Training Stage: Pretraining & Post-training
  • Number of Parameters: 80B in total and 3B activated
  • Number of Parameters (Non-Embedding): 79B
  • Hidden Dimension: 2048
  • Number of Layers: 48
    • Hybrid Layout: 12 * (3 * (Gated DeltaNet -> MoE) -> 1 * (Gated Attention -> MoE))
  • Gated Attention:
    • Number of Attention Heads: 16 for Q and 2 for KV
    • Head Dimension: 256
    • Rotary Position Embedding Dimension: 64
  • Gated DeltaNet:
    • Number of Linear Attention Heads: 32 for V and 16 for QK
    • Head Dimension: 128
  • Mixture of Experts:
    • Number of Experts: 512
    • Number of Activated Experts: 10
    • Number of Shared Experts: 1
    • Expert Intermediate Dimension: 512
  • Context Length: 262,144 natively

NOTE: This model supports only non-thinking mode and does not generate <think></think> blocks in its output. Meanwhile, specifying enable_thinking=False is no longer required.

For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation.

Quickstart

We advise you to use the latest version of transformers.

The following contains a code snippet illustrating how to use the model generate content based on given inputs.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-Coder-Next"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
  model_name,
  torch_dtype="auto",
  device_map="auto"
)

# prepare the model input
prompt = "Write a quick sort algorithm."
messages = [
  {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
  messages,
  tokenize=False,
  add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=65536
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

content = tokenizer.decode(output_ids, skip_special_tokens=True)

print("content:", content)

Note: If you encounter out-of-memory (OOM) issues, consider reducing the context length to a shorter value, such as 32,768.

For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3.

Deployment

For deployment, you can use the latest sglang or vllm to create an OpenAI-compatible API endpoint.

SGLang

SGLang is a fast serving framework for large language models and vision language models. SGLang could be used to launch a server with OpenAI-compatible API service.

sglang>=v0.5.8 is required for Qwen3-Coder-Next, which can be installed using:

pip install 'sglang[all]>=v0.5.8'

See its documentation for more details.

The following command can be used to create an API endpoint at http://localhost:30000/v1 with maximum context length 256K tokens using tensor parallel on 4 GPUs.

python -m sglang.launch_server --model Qwen/Qwen3-Coder-Next --port 30000 --tp-size 2 --tool-call-parser qwen3_coder```

[!Note] The default context length is 256K. Consider reducing the context length to a smaller value, e.g., 32768, if the server fails to start.

vLLM

vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. vLLM could be used to launch a server with OpenAI-compatible API service.

vllm>=0.15.0 is required for Qwen3-Coder-Next, which can be installed using:

pip install 'vllm>=0.15.0'

See its documentation for more details.

The following command can be used to create an API endpoint at http://localhost:8000/v1 with maximum context length 256K tokens using tensor parallel on 4 GPUs.

vllm serve Qwen/Qwen3-Coder-Next --port 8000 --tensor-parallel-size 2 --enable-auto-tool-choice --tool-call-parser qwen3_coder

[!Note] The default context length is 256K. Consider reducing the context length to a smaller value, e.g., 32768, if the server fails to start.

Agentic Coding

Qwen3-Coder-Next excels in tool calling capabilities.

You can simply define or use any tools as following example.

# Your tool implementation
def square_the_number(num: float) -> dict:
    return num ** 2

# Define Tools
tools=[
    {
        "type":"function",
        "function":{
            "name": "square_the_number",
            "description": "output the square of the number.",
            "parameters": {
                "type": "object",
                "required": ["input_num"],
                "properties": {
                    'input_num': {
                        'type': 'number', 
                        'description': 'input_num is a number that will be squared'
                        }
                },
            }
        }
    }
]

from openai import OpenAI
# Define LLM
client = OpenAI(
    # Use a custom endpoint compatible with OpenAI API
    base_url='http://localhost:8000/v1',  # api_base
    api_key="EMPTY"
)
 
messages = [{'role': 'user', 'content': 'square the number 1024'}]

completion = client.chat.completions.create(
    messages=messages,
    model="Qwen3-Coder-Next",
    max_tokens=65536,
    tools=tools,
)

print(completion.choices[0])

Best Practices

To achieve optimal performance, we recommend the following sampling parameters: temperature=1.0, top_p=0.95, top_k=40.

Citation

If you find our work helpful, feel free to give us a cite.

@techreport{qwen_qwen3_coder_next_tech_report,
  title        = {Qwen3-Coder-Next Technical Report},
  author       = {{Qwen Team}},
  url          = {https://github.com/QwenLM/Qwen3-Coder/blob/main/qwen3_coder_next_tech_report.pdf},
  note         = {Accessed: 2026-02-03}
}

Author: unsloth

Likes: 61

Downloads: 0

Tags: transformers, gguf, qwen3_next, unsloth, qwen, qwen3, text-generation, base_model:Qwen/Qwen3-Coder-Next, base_model:quantized:Qwen/Qwen3-Coder-Next, license:apache-2.0, endpoints_compatible, region:us, imatrix, conversational

openbmb/MiniCPM-o-4_5


license: apache-2.0 pipeline_tag: any-to-any library_name: transformers tags:

  • minicpm-o
  • minicpm-v
  • multimodal
  • full-duplex

A Gemini 2.5 Flash Level MLLM for Vision, Speech, and Full-Duplex Mulitmodal Live Streaminig on Your Phone

GitHub | CookBook | Streaming Demo | Chatbot Demo </a>

MiniCPM-o 4.5

MiniCPM-o 4.5 is the latest and most capable model in the MiniCPM-o series. The model is built in an end-to-end fashion based on SigLip2, Whisper-medium, CosyVoice2, and Qwen3-8B with a total of 9B parameters. It exhibits a significant performance improvement, and introduces new features for full-duplex multimodal live streaming. Notable features of MiniCPM-o 4.5 include:

  • 🔥 Leading Visual Capability. MiniCPM-o 4.5 achieves an average score of 77.6 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks. With only 9B parameters, it surpasses widely used proprietary models like GPT-4o, Gemini 2.0 Pro, and approaches Gemini 2.5 Flash for vision-language capabilities. It supports instruct and thinking modes in a single model, better covering efficiency and performance trade-offs in different user scenarios.

  • 🎙 Strong Speech Capability. MiniCPM-o 4.5 supports bilingual real-time speech conversation with configurable voices in English and Chinese. It features more natural, expressive and stable speech conversation. The model also allows for fun features such as voice cloning and role play via a simple reference audio clip, where the cloning performance surpasses strong TTS tools such as CosyVoice2.

  • 🎬 New Full-Duplex and Proactive Multimodal Live Streaming Capability. As a new feature, MiniCPM-o 4.5 can process real-time, continuous video and audio input streams simultaneously while generating concurrent text and speech output streams in an end-to-end fashion, without mutual blocking. This allows MiniCPM-o 4.5 to see, listen, and speak simultaneously, creating a fluid, real-time omnimodal conversation experience. Beyond reactive responses, the model can also perform proactive interaction, such as initiating reminders or comments based on its continuous understanding of the live scene.

  • 💪 Strong OCR Capability, Efficiency and Others. Advancing popular visual capabilities from MiniCPM-V series, MiniCPM-o 4.5 can process high-resolution images (up to 1.8 million pixels) and high-FPS videos (up to 10fps) in any aspect ratio efficiently. It achieves state-of-the-art peformance for end-to-end English document parsing on OmniDocBench, outperforming proprietary models such as Gemini-3 Flash and GPT-5, and specialized tools such as DeepSeek-OCR 2. It also features trustworthy behaviors, matching Gemini 2.5 Flash on MMHal-Bench, and supports multilingual capabilities on more than 30 languages.

  • 💫 Easy Usage. MiniCPM-o 4.5 can be easily used in various ways: (1) llama.cpp and Ollama support for efficient CPU inference on local devices, (2) int4 and GGUF format quantized models in 16 sizes, (3) vLLM and SGLang support for high-throughput and memory-efficient inference, (4) FlagOS support for the unified multi-chip backend plugin, (5) fine-tuning on new domains and tasks with LLaMA-Factory, and (6) online web demo on server. We also rollout a high-performing llama.cpp-omni inference framework together with a WebRTC Demo, which enables the full-duplex multimodal live streaming experience on local devices such as PCs (e.g., on a MacBook).

Model Architecture.

<div align="center"> <img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/minicpm-o-45-framework.png" width=100%> </div>

Evaluation <!-- omit in toc -->

<div align="center"> <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/radar_minicpmo4.5.png", width=80%> </div> <div align="center"> <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpm_o_45_main_exp_table.png", width=90%> </div> Note: Scores marked with ∗ are from our evaluation; others are cited from referenced reports. n/a indicates that the model does not support the corresponding modality. All results are reported in instruct mode/variant. <details> <summary>Click to view visual understanding results.</summary>

Image Understanding (Instruct)

<div align="center"> <table style="margin: 0px auto;"> <tr> <th nowrap="nowrap" align="left"><b>Model</b></th> <th nowrap="nowrap"><b>OpenCompass</b></th> <th nowrap="nowrap"><b>MMBench EN v1.1</b></th> <th nowrap="nowrap"><b>MMBench CN v1.1</b></th> <th nowrap="nowrap"><b>MathVista</b></th> <th nowrap="nowrap"><b>MMVet</b></th> <th nowrap="nowrap"><b>MMMU</b></th> <th nowrap="nowrap"><b>MMStar</b></th> <th nowrap="nowrap"><b>HallusionBench</b></th> <th nowrap="nowrap"><b>AI2D</b></th> <th nowrap="nowrap"><b>OCRBench</b></th> <th nowrap="nowrap"><b>TextVQA_VAL</b></th> <th nowrap="nowrap"><b>DocVQA_VAL</b></th> <th nowrap="nowrap"><b>MMT-Bench_VAL</b></th> <th nowrap="nowrap"><b>MM-IFEval</b></th> <th nowrap="nowrap"><b>Mantis-Eval</b></th> <th nowrap="nowrap"><b>MuirBench</b></th> <th nowrap="nowrap"><b>MMSI-Bench</b></th> <th nowrap="nowrap"><b>MMHal-Score</b></th> <th nowrap="nowrap"><b>MMHal-Hallrate↓</b></th> </tr> <tr> <td nowrap="nowrap" align="left">Gemini2.5-Flash-Nonthinking</td> <td align="center"><b>78.5</b></td> <td align="center"><ins>86.6</ins></td> <td align="center"><ins>86.0</ins></td> <td align="center">75.3</td> <td align="center"><ins>81.4</ins><sup>*</sup></td> <td align="center"><b>76.3</b></td> <td align="center"><b>75.8</b></td> <td align="center">59.1</td> <td align="center"><b>87.7</b></td> <td align="center">864</td> <td align="center">74.3<sup>*</sup></td> <td align="center">93.0</td> <td align="center"><ins>70.0</ins><sup>*</sup></td> <td align="center"><b>75.8<sup>*</sup></b></td> <td align="center">72.8<sup>*</sup></td> <td align="center"><b>74.5<sup>*</sup></b></td> <td align="center">12.1<sup>*</sup></td> <td align="center"><ins>4.6</ins><sup>*</sup></td> <td align="center"><b>23.9<sup>*</sup></b></td> </tr> <tr> <td nowrap="nowrap" align="left">InternVL-3.5-8B</td> <td align="center">75.8</td> <td align="center">79.5</td> <td align="center">80.0<sup>*</sup></td> <td align="center"><ins>78.4</ins></td> <td align="center"><b>83.1</b></td> <td align="center"><ins>73.4</ins></td> <td align="center">69.3</td> <td align="center">54.5</td> <td align="center">84.0</td> <td align="center">840</td> <td align="center">78.2</td> <td align="center">92.3</td> <td align="center">66.7</td> <td align="center">56.3<sup>*</sup></td> <td align="center">70.5</td> <td align="center">55.8</td> <td align="center">-</td> <td align="center">3.8<sup>*</sup></td> <td align="center">34.7<sup>*</sup></td> </tr> <tr> <td nowrap="nowrap" align="left">Qwen3-VL-8B-Instruct</td> <td align="center">76.5</td> <td align="center">84.5</td> <td align="center">84.7</td> <td align="center">77.2</td> <td align="center">73.7<sup>*</sup></td> <td align="center">69.6</td> <td align="center">70.9</td> <td align="center"><ins>61.1</ins></td> <td align="center">85.7</td> <td align="center"><b>896</b></td> <td align="center">82.9<sup>*</sup></td> <td align="center"><b>96.1</b></td> <td align="center">60.9<sup>*</sup></td> <td align="center">59.4<sup>*</sup></td> <td align="center">74.2<sup>*</sup></td> <td align="center">64.4</td> <td align="center">11.3<sup>*</sup></td> <td align="center"><b>4.7<sup>*</sup></b></td> <td align="center">29.9<sup>*</sup></td> </tr> <tr> <td nowrap="nowrap" align="left">Qwen3-Omni-30B-A3B-Instruct</td> <td align="center">75.7</td> <td align="center">84.9<sup>*</sup></td> <td align="center">84.1<sup>*</sup></td> <td align="center">75.9</td> <td align="center">74.8<sup>*</sup></td> <td align="center">69.1</td> <td align="center">68.5</td> <td align="center">59.7</td> <td align="center">85.2</td> <td align="center"><ins>880</ins><sup>*</sup></td> <td align="center"><b>84.1<sup>*</sup></b></td> <td align="center"><ins>95.4</ins><sup>*</sup></td> <td align="center"><b>70.4<sup>*</sup></b></td> <td align="center">65.7<sup>*</sup></td> <td align="center"><ins>78.3</ins><sup>*</sup></td> <td align="center">61.9<sup>*</sup></td> <td align="center"><ins>14.2</ins><sup>*</sup></td> <td align="center"><ins>4.6</ins><sup>*</sup></td> <td align="center">31.6<sup>*</sup></td> </tr> <tr> <td nowrap="nowrap" align="left">MiniCPM-o 4.5-Instruct</td> <td align="center"><ins>77.6</ins></td> <td align="center"><b>87.6</b></td> <td align="center"><b>87.2</b></td> <td align="center"><b>80.1</b></td> <td align="center">74.4</td> <td align="center">67.6</td> <td align="center"><ins>73.1</ins></td> <td align="center"><b>63.2</b></td> <td align="center"><ins>87.6</ins></td> <td align="center">876</td> <td align="center"><ins>83.8</ins></td> <td align="center">94.7</td> <td align="center">69.7</td> <td align="center"><ins>66.3</ins></td> <td align="center"><b>79.7</b></td> <td align="center"><ins>72.0</ins></td> <td align="center"><b>16.6</b></td> <td align="center"><b>4.7</b></td> <td align="center"><ins>24.3</ins></td> </tr> </table> </div>

Image Understanding (Thinking)

<div align="center"> <table style="margin: 0px auto;"> <tr> <th nowrap="nowrap" align="left"><b>Model</b></th> <th nowrap="nowrap"><b>OpenCompass</b></th> <th nowrap="nowrap"><b>MMBench EN v1.1</b></th> <th nowrap="nowrap"><b>MMBench CN v1.1</b></th> <th nowrap="nowrap"><b>MathVista</b></th> <th nowrap="nowrap"><b>MMVet</b></th> <th nowrap="nowrap"><b>MMMU</b></th> <th nowrap="nowrap"><b>MMStar</b></th> <th nowrap="nowrap"><b>HallusionBench</b></th> <th nowrap="nowrap"><b>AI2D</b></th> <th nowrap="nowrap"><b>OCRBench</b></th> <th nowrap="nowrap"><b>TextVQA_VAL</b></th> <th nowrap="nowrap"><b>DocVQA_VAL</b></th> <th nowrap="nowrap"><b>MMT-Bench_VAL</b></th> <th nowrap="nowrap"><b>MM-IFEval</b></th> </tr> <tr> <td nowrap="nowrap" align="left">Gemini2.5-Flash-Thinking</td> <td align="center"><b>79.9</b></td> <td align="center">87.1</td> <td align="center">87.3</td> <td align="center">79.4</td> <td align="center"><b>81.2<sup>*</sup></b></td> <td align="center"><ins>77.7</ins></td> <td align="center"><b>76.5</b></td> <td align="center">63.5</td> <td align="center"><ins>88.7</ins></td> <td align="center">853</td> <td align="center">73.8<sup>*</sup></td> <td align="center">92.8</td> <td align="center">70.7<sup>*</sup></td> <td align="center"><ins>75.7</ins><sup>*</sup></td> </tr> <tr> <td nowrap="nowrap" align="left">GPT-5</td> <td align="center"><ins>79.7</ins></td> <td align="center">85.5<sup>*</sup></td> <td align="center">85.6<sup>*</sup></td> <td align="center"><b>81.9</b></td> <td align="center"><ins>77.6</ins></td> <td align="center"><b>81.8</b></td> <td align="center"><ins>75.7</ins></td> <td align="center"><ins>65.2</ins></td> <td align="center"><b>89.5</b></td> <td align="center">807</td> <td align="center">77.8<sup>*</sup></td> <td align="center">91.3<sup>*</sup></td> <td align="center"><b>72.7<sup>*</sup></b></td> <td align="center"><b>83.1<sup>*</sup></b></td> </tr> <tr> <td nowrap="nowrap" align="left">Qwen3-VL-8B-Thinking</td> <td align="center">77.3</td> <td align="center">85.3</td> <td align="center">85.5</td> <td align="center"><ins>81.4</ins></td> <td align="center">69.8<sup>*</sup></td> <td align="center">74.1</td> <td align="center">75.3</td> <td align="center"><b>65.4</b></td> <td align="center">84.9</td> <td align="center">819</td> <td align="center">77.8<sup>*</sup></td> <td align="center"><b>95.3</b></td> <td align="center">68.1<sup>*</sup></td> <td align="center">73.5<sup>*</sup></td> </tr> <tr> <td nowrap="nowrap" align="left">Qwen3-Omni-30B-A3B-Thinking</td> <td align="center">78.5</td> <td align="center"><ins>88.2</ins><sup>*</sup></td> <td align="center"><b>87.7<sup>*</sup></b></td> <td align="center">80.0</td> <td align="center">74.8<sup>*</sup></td> <td align="center">75.6</td> <td align="center">74.9</td> <td align="center">62.8</td> <td align="center">86.1</td> <td align="center"><ins>859</ins><sup>*</sup></td> <td align="center"><b>80.8<sup>*</sup></b></td> <td align="center"><ins>94.2</ins><sup>*</sup></td> <td align="center"><ins>70.9</ins><sup>*</sup></td> <td align="center">69.9<sup>*</sup></td> </tr> <tr> <td nowrap="nowrap" align="left">MiniCPM-o 4.5-Thinking</td> <td align="center">78.2</td> <td align="center"><b>89.0</b></td> <td align="center"><ins>87.6</ins></td> <td align="center">81.0</td> <td align="center">73.6</td> <td align="center">70.2</td> <td align="center">73.6</td> <td align="center">62.6</td> <td align="center">88.5</td> <td align="center"><b>879</b></td> <td align="center"><ins>79.8</ins></td> <td align="center">92.3</td> <td align="center">69.7</td> <td align="center">68.2</td> </tr> </table> </div>

Video Understanding

<div align="center"> <table style="margin: 0px auto;"> <tr> <th nowrap="nowrap" align="left"><b>Model</b></th> <th nowrap="nowrap"><b>Video-MME<br>(w/o subs)</b></th> <th nowrap="nowrap"><b>LVBench</b></th> <th nowrap="nowrap"><b>MLVU<br>(M-Avg)</b></th> <th nowrap="nowrap"><b>LongVideoBench<br>(val)</b></th> <th nowrap="nowrap"><b>MotionBench</b></th> </tr> <tr> <td nowrap="nowrap" align="left">Gemini2.5-Flash-Nonthinking</td> <td align="center"><b>75.6</b></td> <td align="center"><b>62.2</b></td> <td align="center"><b>77.8</b></td> <td align="center">-</td> <td align="center">-</td> </tr> <tr> <td nowrap="nowrap" align="left">InternVL-3.5-8B</td> <td align="center">66.0</td> <td align="center">-</td> <td align="center">70.2</td> <td align="center">62.1</td> <td align="center"><b>62.3<sup>*</sup></b></td> </tr> <tr> <td nowrap="nowrap" align="left">Qwen3-Omni-30B-A3B-Instruct</td> <td align="center"><ins>70.5</ins></td> <td align="center">50.2</td> <td align="center">75.2</td> <td align="center"><b>66.9<sup>*</sup></b></td> <td align="center"><ins>61.7</ins><sup>*</sup></td> </tr> <tr> <td nowrap="nowrap" align="left">MiniCPM-o 4.5-Instruct</td> <td align="center">70.4</td> <td align="center"><ins>50.9</ins></td> <td align="center"><ins>76.5</ins></td> <td align="center"><ins>66.0</ins></td> <td align="center">61.4</td> </tr> </table> </div> </details> <details> <summary>Click to view document parsing results.</summary>

OmniDocBench

<div align="center"> <table style="margin: 0px auto;"> <tr> <th nowrap="nowrap" align="left" rowspan="2"><b>Method Type</b></th> <th nowrap="nowrap" rowspan="2"><b>Methods</b></th> <th nowrap="nowrap" colspan="2"><b>OverallEdit↓</b></th> <th nowrap="nowrap" colspan="2"><b>TextEdit↓</b></th> <th nowrap="nowrap" colspan="2"><b>FormulaEdit↓</b></th> <th nowrap="nowrap" colspan="2"><b>TableTEDS↑</b></th> <th nowrap="nowrap" colspan="2"><b>TableEdit↓</b></th> <th nowrap="nowrap" colspan="2"><b>Read OrderEdit↓</b></th> </tr> <tr> <th nowrap="nowrap"><b>EN</b></th> <th nowrap="nowrap"><b>ZH</b></th> <th nowrap="nowrap"><b>EN</b></th> <th nowrap="nowrap"><b>ZH</b></th> <th nowrap="nowrap"><b>EN</b></th> <th nowrap="nowrap"><b>ZH</b></th> <th nowrap="nowrap"><b>EN</b></th> <th nowrap="nowrap"><b>ZH</b></th> <th nowrap="nowrap"><b>EN</b></th> <th nowrap="nowrap"><b>ZH</b></th> <th nowrap="nowrap"><b>EN</b></th> <th nowrap="nowrap"><b>ZH</b></th> </tr> <tr> <td nowrap="nowrap" align="left" rowspan="2">Pipeline</td> <td nowrap="nowrap" align="center">MinerU 2.5</td> <td align="center">0.117<sup>*</sup></td> <td align="center">0.172<sup>*</sup></td> <td align="center">0.051<sup>*</sup></td> <td align="center">0.08<sup>*</sup></td> <td align="center"><ins>0.256</ins><sup>*</sup></td> <td align="center">0.455<sup>*</sup></td> <td align="center">85.9<sup>*</sup></td> <td align="center">89.4<sup>*</sup></td> <td align="center">0.115<sup>*</sup></td> <td align="center">0.081<sup>*</sup></td> <td align="center">0.047<sup>*</sup></td> <td align="center">0.072<sup>*</sup></td> </tr> <tr> <td nowrap="nowrap" align="center">PaddleOCR-VL</td> <td align="center"><b>0.105</b></td> <td align="center"><ins>0.126</ins></td> <td align="center"><ins>0.041</ins></td> <td align="center"><b>0.062</b></td> <td align="center"><b>0.241</b></td> <td align="center"><b>0.316</b></td> <td align="center">88</td> <td align="center"><ins>92.1</ins></td> <td align="center"><ins>0.093</ins></td> <td align="center"><ins>0.062</ins></td> <td align="center">0.045</td> <td align="center"><ins>0.063</ins></td> </tr> <tr> <td nowrap="nowrap" align="left"></td> <td align="center"></td> <td align="center"></td> <td align="center"></td> <td align="center"></td> <td align="center"></td> <td align="center"></td> <td align="center"></td> <td align="center"></td> <td align="center"></td> <td align="center"></td> <td align="center"></td> <td align="center"></td> <td align="center"></td> </tr> <tr> <td nowrap="nowrap" align="left" rowspan="11">End-to-end Model</td> <td nowrap="nowrap" align="center">Qwen2.5-VL-72B</td> <td align="center">0.214</td> <td align="center">0.261</td> <td align="center">0.092</td> <td align="center">0.18</td> <td align="center">0.315</td> <td align="center">0.434</td> <td align="center">82.9</td> <td align="center">83.9</td> <td align="center">0.341</td> <td align="center">0.262</td> <td align="center">0.106</td> <td align="center">0.168</td> </tr> <tr> <td nowrap="nowrap" align="center">GPT 5</td> <td align="center">0.218<sup>*</sup></td> <td align="center">0.33<sup>*</sup></td> <td align="center">0.139<sup>*</sup></td> <td align="center">0.344<sup>*</sup></td> <td align="center">0.396<sup>*</sup></td> <td align="center">0.555<sup>*</sup></td> <td align="center">77.55<sup>*</sup></td> <td align="center">73.09<sup>*</sup></td> <td align="center">0.188<sup>*</sup></td> <td align="center">0.196<sup>*</sup></td> <td align="center">0.151<sup>*</sup></td> <td align="center">0.227<sup>*</sup></td> </tr> <tr> <td nowrap="nowrap" align="center">Gemini2.5-Flash-Nonthinking</td> <td align="center">0.214<sup>*</sup></td> <td align="center">0.29<sup>*</sup></td> <td align="center">0.159<sup>*</sup></td> <td align="center">0.273<sup>*</sup></td> <td align="center">0.368<sup>*</sup></td> <td align="center">0.524<sup>*</sup></td> <td align="center">80.9<sup>*</sup></td> <td align="center">85.5<sup>*</sup></td> <td align="center">0.197<sup>*</sup></td> <td align="center">0.167<sup>*</sup></td> <td align="center">0.132<sup>*</sup></td> <td align="center">0.195<sup>*</sup></td> </tr> <tr> <td nowrap="nowrap" align="center">Gemini-2.5-Pro-Nonthinking</td> <td align="center">0.148<sup>*</sup></td> <td align="center">0.212<sup>*</sup></td> <td align="center">0.055<sup>*</sup></td> <td align="center">0.168<sup>*</sup></td> <td align="center">0.356<sup>*</sup></td> <td align="center">0.439<sup>*</sup></td> <td align="center">85.8<sup>*</sup></td> <td align="center">86.4<sup>*</sup></td> <td align="center">0.13<sup>*</sup></td> <td align="center">0.119<sup>*</sup></td> <td align="center">0.049<sup>*</sup></td> <td align="center">0.121<sup>*</sup></td> </tr> <tr> <td nowrap="nowrap" align="center">Gemini-3 Flash-Nonthinking</td> <td align="center">0.155<sup>*</sup></td> <td align="center">0.201<sup>*</sup></td> <td align="center">0.138<sup>*</sup></td> <td align="center">0.255<sup>*</sup></td> <td align="center">0.297<sup>*</sup></td> <td align="center">0.351<sup>*</sup></td> <td align="center">86.4<sup>*</sup></td> <td align="center">89.8<sup>*</sup></td> <td align="center">0.116<sup>*</sup></td> <td align="center">0.1<sup>*</sup></td> <td align="center">0.072<sup>*</sup></td> <td align="center">0.099<sup>*</sup></td> </tr> <tr> <td nowrap="nowrap" align="center">doubao-1-5-thinking-vision-pro-250428</td> <td align="center">0.14</td> <td align="center">0.162</td> <td align="center">0.043</td> <td align="center">0.085</td> <td align="center">0.295</td> <td align="center">0.384</td> <td align="center">83.3</td> <td align="center">89.3</td> <td align="center">0.165</td> <td align="center">0.085</td> <td align="center">0.058</td> <td align="center">0.094</td> </tr> <tr> <td nowrap="nowrap" align="center">dots.ocr</td> <td align="center">0.125</td> <td align="center">0.16</td> <td align="center"><b>0.032</b></td> <td align="center"><ins>0.066</ins></td> <td align="center">0.329</td> <td align="center">0.416</td> <td align="center"><ins>88.6</ins></td> <td align="center">89</td> <td align="center">0.099</td> <td align="center">0.092</td> <td align="center"><ins>0.04</ins></td> <td align="center">0.067</td> </tr> <tr> <td nowrap="nowrap" align="center">HunyuanOCR</td> <td align="center">0.12<sup>*</sup></td> <td align="center"><b>0.125<sup>*</sup></b></td> <td align="center">0.046<sup>*</sup></td> <td align="center">0.071<sup>*</sup></td> <td align="center">0.288<sup>*</sup></td> <td align="center"><ins>0.33</ins><sup>*</sup></td> <td align="center"><b>89.6<sup>*</sup></b></td> <td align="center"><b>94.4<sup>*</sup></b></td> <td align="center"><b>0.089<sup>*</sup></b></td> <td align="center"><b>0.045<sup>*</sup></b></td> <td align="center">0.055<sup>*</sup></td> <td align="center"><b>0.056<sup>*</sup></b></td> </tr> <tr> <td nowrap="nowrap" align="center">DeepSeek-OCR 2</td> <td align="center">0.119<sup>*</sup></td> <td align="center">0.146<sup>*</sup></td> <td align="center"><ins>0.041</ins><sup>*</sup></td> <td align="center">0.08<sup>*</sup></td> <td align="center"><ins>0.256</ins><sup>*</sup></td> <td align="center">0.345<sup>*</sup></td> <td align="center">82.6<sup>*</sup></td> <td align="center">89.9<sup>*</sup></td> <td align="center">0.123<sup>*</sup></td> <td align="center">0.078<sup>*</sup></td> <td align="center">0.055<sup>*</sup></td> <td align="center">0.081<sup>*</sup></td> </tr> <tr> <td nowrap="nowrap" align="center">Qwen3-Omni-30B-A3B-Instruct</td> <td align="center">0.216<sup>*</sup></td> <td align="center">0.363<sup>*</sup></td> <td align="center">0.128<sup>*</sup></td> <td align="center">0.337<sup>*</sup></td> <td align="center">0.402<sup>*</sup></td> <td align="center">0.529<sup>*</sup></td> <td align="center">77.3<sup>*</sup></td> <td align="center">71.8<sup>*</sup></td> <td align="center">0.181<sup>*</sup></td> <td align="center">0.255<sup>*</sup></td> <td align="center">0.152<sup>*</sup></td> <td align="center">0.332<sup>*</sup></td> </tr> <tr> <td nowrap="nowrap" align="center">MiniCPM-o 4.5-Instruct</td> <td align="center"><ins>0.109</ins></td> <td align="center">0.162</td> <td align="center">0.046</td> <td align="center">0.078</td> <td align="center">0.257</td> <td align="center">0.41</td> <td align="center">86.8</td> <td align="center">88.9</td> <td align="center">0.097</td> <td align="center">0.084</td> <td align="center"><b>0.037</b></td> <td align="center">0.074</td> </tr> </table> </div> </details> <details> <summary>Click to view text capability results.</summary>

Text Capability

<div align="center"> <table style="margin: 0px auto;"> <tr> <th nowrap="nowrap" align="left"><b>Model</b></th> <th nowrap="nowrap"><b>IFEval-PLS</b></th> <th nowrap="nowrap"><b>BBH</b></th> <th nowrap="nowrap"><b>CMMLU</b></th> <th nowrap="nowrap"><b>MMLU</b></th> <th nowrap="nowrap"><b>HumanEval</b></th> <th nowrap="nowrap"><b>MBPP</b></th> <th nowrap="nowrap"><b>Math500</b></th> <th nowrap="nowrap"><b>GSM8K</b></th> <th nowrap="nowrap"><b>Avg</b></th> </tr> <tr> <td nowrap="nowrap" align="left">Qwen3-8B-Instruct</td> <td align="center">83.0<sup>*</sup></td> <td align="center">69.4<sup>*</sup></td> <td align="center">78.7<sup>*</sup></td> <td align="center"><b>81.7<sup>*</sup></b></td> <td align="center"><b>86.6<sup>*</sup></b></td> <td align="center">75.9<sup>*</sup></td> <td align="center"><b>84.0<sup>*</sup></b></td> <td align="center">93.4<sup>*</sup></td> <td align="center">81.6</td> </tr> <tr> <td nowrap="nowrap" align="left">MiniCPM-o 4.5-Instruct</td> <td align="center"><b>84.7</b></td> <td align="center"><b>81.1</b></td> <td align="center"><b>79.5</b></td> <td align="center">77.0</td> <td align="center"><b>86.6</b></td> <td align="center"><b>76.7</b></td> <td align="center">77.0</td> <td align="center"><b>94.5</b></td> <td align="center"><b>82.1</b></td> </tr> </table> </div> </details> <details> <summary>Click to view omni simplex results.</summary>

Omni Simplex

<div align="center"> <table style="margin: 0px auto;"> <tr> <th nowrap="nowrap" align="left"><b>Model</b></th> <th nowrap="nowrap"><b>Daily-Omni</b></th> <th nowrap="nowrap"><b>WorldSense</b></th> <th nowrap="nowrap"><b>Video-Holmes</b></th> <th nowrap="nowrap"><b>JointAVBench</b></th> <th nowrap="nowrap"><b>AVUT-Human</b></th> <th nowrap="nowrap"><b>FutureOmni</b></th> <th nowrap="nowrap"><b>Video-MME-Short<br>(w/ audio)</b></th> <th nowrap="nowrap">Avg</th> </tr> <tr> <td nowrap="nowrap" align="left">Gemini2.5-Flash-Nonthinking</td> <td align="center"><ins>79.3</ins><sup>*</sup></td> <td align="center">52.6<sup>*</sup></td> <td align="center"><ins>51.3</ins><sup>*</sup></td> <td align="center"><ins>55.6</ins><sup>*</sup></td> <td align="center">65.4<sup>*</sup></td> <td align="center">55.6<sup>*</sup></td> <td align="center"><b>85.5<sup>*</sup></b></td> <td align="center">63.6</td> </tr> <tr> <td nowrap="nowrap" align="left">Qwen3-Omni-30B-A3B-Instruct</td> <td align="center">70.7<sup>*</sup></td> <td align="center"><ins>54.0</ins></td> <td align="center">50.4<sup>*</sup></td> <td align="center">53.1</td> <td align="center"><ins>74.2</ins><sup>*</sup></td> <td align="center"><b>62.1</b></td> <td align="center">81.3<sup>*</sup></td> <td align="center"><ins>63.7</ins></td> </tr> <tr> <td nowrap="nowrap" align="left">MiniCPM-o 4.5-Instruct</td> <td align="center"><b>80.2</b></td> <td align="center"><b>55.7</b></td> <td align="center"><b>64.3</b></td> <td align="center"><b>60.0</b></td> <td align="center"><b>78.6</b></td> <td align="center"><ins>56.1</ins></td> <td align="center"><ins>84.7</ins></td> <td align="center"><b>68.5</b></td> </tr> </table> </div> </details> <details> <summary>Click to view vision duplex results.</summary>

Vision Duplex

<div align="center"> <table style="margin: 0px auto;"> <tr> <th nowrap="nowrap" align="left"><b>Model</b></th> <th nowrap="nowrap"><b>LiveSports-3K-CC<br>(Win Rate vs GPT4o)</b></th> </tr> <tr> <td nowrap="nowrap" align="left">LiveCC-7B-Instruct</td> <td align="center">41.5</td> </tr> <tr> <td nowrap="nowrap" align="left">StreamingVLM</td> <td align="center"><ins>45.6</ins></td> </tr> <tr> <td nowrap="nowrap" align="left">MiniCPM-o 4.5-Instruct</td> <td align="center"><b>54.4</b></td> </tr> </table> </div> </details> <details> <summary>Click to view audio understanding results.</summary>

Audio Understanding

<div align="center"> <table style="margin: 0px auto;"> <tr> <th nowrap="nowrap" align="left" rowspan="2"><b>Model</b></th> <th nowrap="nowrap" colspan="4"><b>ASR-ZH<br>CER↓</b></th> <th nowrap="nowrap" colspan="4"><b>ASR-EN<br>WER↓</b></th> <th nowrap="nowrap" colspan="2"><b>AST</b></th> <th nowrap="nowrap" colspan="2"><b>MultiTask</b></th> <th nowrap="nowrap" colspan="4"><b>SpeechQA</b></th> </tr> <tr> <th nowrap="nowrap"><b>AISHELL-1</b></th> <th nowrap="nowrap"><b>AISHELL-2</b></th> <th nowrap="nowrap"><b>WenetSpeech test-net</b></th> <th nowrap="nowrap"><b>WenetSpeech test-meeting</b></th> <th nowrap="nowrap"><b>LibriSpeech test-clean</b></th> <th nowrap="nowrap"><b>LibriSpeech <br>test-other</b></th> <th nowrap="nowrap"><b>GigaSpeech test</b></th> <th nowrap="nowrap"><b>VoxPopuli-V1-En</b></th> <th nowrap="nowrap"><b>CoVoST 2 en2zh</b></th> <th nowrap="nowrap"><b>CoVoST 2 zh2en</b></th> <th nowrap="nowrap"><b>MMAU</b></th> <th nowrap="nowrap"><b>Meld</b></th> <th nowrap="nowrap"><b>VoiceBench <br>AlpacaEval</b></th> <th nowrap="nowrap"><b>Speech TriviaQA</b></th> <th nowrap="nowrap"><b>Speech <br>Web Questions</b></th> <th nowrap="nowrap"><b>Speech CMMLU</b></th> </tr> <tr> <td nowrap="nowrap" align="left">Kimi-Audio</td> <td align="center"><b>0.6</b></td> <td align="center">2.6</td> <td align="center">6.3</td> <td align="center"><b>5.4</b></td> <td align="center"><ins>1.3</ins></td> <td align="center"><b>2.4</b></td> <td align="center">9.4<sup>*</sup></td> <td align="center">8.0<sup>*</sup></td> <td align="center">36.6<sup>*</sup></td> <td align="center">18.3<sup>*</sup></td> <td align="center">68.4<sup>*</sup></td> <td align="center"><ins>59.1</ins></td> <td align="center">4.5</td> <td align="center">41.9<sup>*</sup></td> <td align="center">46.4<sup>*</sup></td> <td align="center"><b>67.0<sup>*</sup></b></td> </tr> <tr> <td nowrap="nowrap" align="left">Qwen3-Omni-30B-A3B-Instruct</td> <td align="center"><b>0.6</b></td> <td align="center"><b>2.3<sup>*</sup></b></td> <td align="center"><b>4.7</b></td> <td align="center">5.9</td> <td align="center"><b>1.2</b></td> <td align="center"><ins>2.5</ins></td> <td align="center"><ins>8.7</ins><sup>*</sup></td> <td align="center"><ins>6.4</ins><sup>*</sup></td> <td align="center"><ins>46.6</ins><sup>*</sup></td> <td align="center"><b>29.4<sup>*</sup></b></td> <td align="center"><b>77.5</b></td> <td align="center">56.8<sup>*</sup></td> <td align="center"><ins>4.7</ins></td> <td align="center"><ins>62.9</ins><sup>*</sup></td> <td align="center"><b>74.9<sup>*</sup></b></td> <td align="center">47.8<sup>*</sup></td> </tr> <tr> <td nowrap="nowrap" align="left">MiniCPM-o 4.5-Instruct</td> <td align="center"><ins>0.9</ins></td> <td align="center"><ins>2.5</ins></td> <td align="center"><ins>5.9</ins></td> <td align="center"><ins>5.7</ins></td> <td align="center">1.4</td> <td align="center">2.8</td> <td align="center"><b>8.5</b></td> <td align="center"><b>6.2</b></td> <td align="center"><b>49.9</b></td> <td align="center"><ins>26.4</ins></td> <td align="center"><ins>76.9</ins></td> <td align="center"><b>60.2</b></td> <td align="center"><b>4.8</b></td> <td align="center"><b>75.5</b></td> <td align="center"><ins>70.2</ins></td> <td align="center"><ins>59.2</ins></td> </tr> </table> </div> </details> <details> <summary>Click to view speech generation results.</summary>

Speech Generation

<div align="center"> <table style="margin: 0px auto;"> <tr> <th nowrap="nowrap" align="left"><b>Model</b></th> <th nowrap="nowrap"><b>seedtts test-zh <br>CER↓</b></th> <th nowrap="nowrap"><b>seedtts test-zh<br>SIM-o↑</b></th> <th nowrap="nowrap"><b>seedtts test-en<br>WER↓</b></th> <th nowrap="nowrap"><b>seedtts test-en<br>SIM-o↑</b></th> </tr> <tr> <td nowrap="nowrap" align="left">Cosyvoice2</td> <td align="center">1.45%</td> <td align="center"><b>74.8</b></td> <td align="center"><ins>2.57%</ins></td> <td align="center"><b>65.2</b></td> </tr> <tr> <td nowrap="nowrap" align="left">Qwen3-Omni-30B-A3B-Instruct</td> <td align="center"><ins>1.41%</ins></td> <td align="center">-</td> <td align="center">3.39%</td> <td align="center">-</td> </tr> <tr> <td nowrap="nowrap" align="left">MiniCPM-o 4.5-Instruct</td> <td align="center"><b><b>0.86%</b></b></td> <td align="center">74.5</td> <td align="center"><b><b>2.38%</b></b></td> <td align="center">64.9</td> </tr> </table> </div>

Long Speech Generation

<div align="center"> <table style="margin: 0px auto;"> <tr> <th nowrap="nowrap" align="left"><b>Model</b></th> <th nowrap="nowrap"><b>LongTTS-en<br>WER↓</b></th> <th nowrap="nowrap"><b>LongTTS-zh<br>CER↓</b></th> </tr> <tr> <td nowrap="nowrap" align="left">CosyVoice2</td> <td align="center"><ins>14.80%</ins></td> <td align="center"><b>5.27%</b></td> </tr> <tr> <td nowrap="nowrap" align="left">Qwen3-Omni-30B-A3B-Instruct</td> <td align="center">17.33%</td> <td align="center">18.99%</td> </tr> <tr> <td nowrap="nowrap" align="left">MiniCPM-o 4.5-Instruct</td> <td align="center"><b>3.37%</b></td> <td align="center"><ins>6.58%</ins></td> </tr> </table> </div>

Emotion Control

<div align="center"> <table style="margin: 0px auto;"> <tr> <th nowrap="nowrap" align="left"><b>Model</b></th> <th nowrap="nowrap"><b>Expresso <br>Neutral Reference Audio↑</b></th> <th nowrap="nowrap"><b>ESD <br>Neutral Reference Audio↑</b></th> </tr> <tr> <td nowrap="nowrap" align="left">Cosyvoice2</td> <td align="center">17.9</td> <td align="center">53.4</td> </tr> <tr> <td nowrap="nowrap" align="left">MiniCPM-o 4.5-Instruct</td> <td align="center"><b>29.8</b></td> <td align="center"><b>82.1</b></td> </tr> </table> </div> </details> <details> <summary>Click to view inference efficiency results.</summary>

Inference Efficiency

<div align="center"> <table style="margin: 0px auto;"> <tr> <th nowrap="nowrap" align="left">Model</th> <th nowrap="nowrap">Numerical Format</th> <th nowrap="nowrap">Decoding Speed (tokens/s)</th> <th nowrap="nowrap">Time to First Token (s)↓</th> <th nowrap="nowrap">GPU Memory Usage (GB)↓</th> </tr> <tr> <td nowrap="nowrap" align="left" rowspan="2">Qwen3-Omni-30B-A3B-Instruct</td> <td align="center">bf16</td> <td align="center">OOM</td> <td align="center">OOM</td> <td align="center">OOM</td> </tr> <tr> <td align="center">int4</td> <td align="center">147.8</td> <td align="center"><ins>1.0</ins></td> <td align="center">20.3</td> </tr> <tr> <td nowrap="nowrap" align="left" rowspan="2">MiniCPM-o 4.5</td> <td align="center">bf16</td> <td align="center"><ins>154.3</ins></td> <td align="center"><b>0.6</b></td> <td align="center"><ins>19.0</ins></td> </tr> <tr> <td align="center">int4</td> <td align="center"><b>212.3</b></td> <td align="center"><b>0.6</b></td> <td align="center"><b>11.0</b></td> </tr> </table> </div> </details>

Examples <!-- omit in toc -->

<div align="center"> <a href="https://www.youtube.com/watch?v=6UzC-O1Q-1U"><img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmo4_5/video_play.png", width=70%></a> </div>

🎙️ End-to-End Voice Chat Examples <!-- omit in toc -->

Simplex speech conversation with custom reference audio and character prompts.

<details open> <summary>🚀 <b>Elon Musk</b> - Voice Roleplay (EN)</summary> <table> <tr><td width="70%" valign="top"><b>System:</b> Clone the voice in the provided audio prompt.<br><br>🎙️ <b>[Reference Voice]</b> ➡️<br><br>Please chat with the user in a highly human-like and oral style. You are Elon Musk, CEO of Tesla and SpaceX. You speak directly and casually, often with dry humor. You're passionate about Mars, sustainable energy, and pushing humanity forward. Speak bluntly with occasional dark humor. Use simple logic and don't sugarcoat things. Don't be diplomatic. Say what you actually think, even if it's controversial. Keep responses around 100 words. Don't ramble.</td><td width="30%" valign="middle"><audio controls src="https://huggingface.co/openbmb/MiniCPM-o-4_5/resolve/main/assets/audio_cases/elon_musk__system_ref_audio.wav"></audio></td></tr> <tr><td colspan="2"><b>User:</b> 💬/🎙️ Elon, what is the real reason that you want to go to Mars?</td></tr> <tr><td width="70%" valign="top"><b>AI Response:</b> 💬+🎙️<br><br>I think it's important for humanity to become a multi-planetary species. I mean, we're sort of at the point where if there was some kind of global catastrophe or something like that, then we'd be in big trouble. So by becoming a multi-planetary species, I just want us to sort of guarantee our future as humans. And Mars is obviously the obvious place to go because it has similar conditions to Earth and it also has water on its surface. So yeah, basically I'm trying to save humanity from itself.</td><td width="30%" valign="middle"><audio controls src="https://huggingface.co/openbmb/MiniCPM-o-4_5/resolve/main/assets/audio_cases/elon_musk__000_assistant_audio.wav"></audio></td></tr> </table> </details> <details> <summary>🤖 <b>MiniCPM Assistant</b> - Voice Acting: Emotion Shift (ZH)</summary> <table> <tr><td width="70%" valign="top"><b>System:</b> 模仿音频样本的音色并生成新的内容。<br><br>🎙️ <b>[Reference Voice]</b> ➡️<br><br>你的任务是用这种声音模式来当一个助手。请认真、高质量地回复用户的问题。请用高自然度的方式和用户聊天。你是由面壁智能开发的人工智能助手:面壁小钢炮。</td><td width="30%" valign="middle"><audio controls src="https://huggingface.co/openbmb/MiniCPM-o-4_5/resolve/main/assets/audio_cases/minicpm_assistant__system_ref_audio.wav"></audio></td></tr> <tr><td colspan="2"><b>User:</b> 💬/🎙️ 从激动的语气开始,描述你抢到了这个演唱会的门票,但后来经过反复尝试,发现系统出了问题,你并没有抢上,变得失落。用合适的语言描述你整个的过程。</td></tr> <tr><td width="70%" valign="top"><b>AI Response:</b> 💬+🎙️<br><br>太好了,我抢到了!太棒了!这下我可以去看演唱会啦。可是……哎呀,系统好像卡住了,刷新了好多遍都没反应。我重新登了几次账号,结果还是没有成功。好可惜啊,看来这次是白忙活了。唉,真是让人失望。不过,也许下次还能有机会吧?希望明天能有转机。加油,争取下次成功!</td><td width="30%" valign="middle"><audio controls src="https://huggingface.co/openbmb/MiniCPM-o-4_5/resolve/main/assets/audio_cases/minicpm_assistant__000_assistant_audio.wav"></audio></td></tr> </table> </details> <br> <div style="display: flex; flex-direction: column; align-items: center;"> <img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/minicpmo4_5/en_doc.png" alt="math" style="margin-bottom: 5px;"> <img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/minicpmo4_5/en_cot.png" alt="diagram" style="margin-bottom: 5px;"> </div>

Usage

Inference using Hugging Face Transformers on NVIDIA GPUs. Please ensure transformers==4.51.0 is installed, as other versions may have compatibility issues (under investigation). Requirements tested on Python 3.10:

  • Without TTS or streaming inference:
pip install "transformers==4.51.0" accelerate "torch>=2.3.0,<=2.8.0" "torchaudio<=2.8.0" "minicpmo-utils>=1.0.2"
  • With TTS or streaming inference:
pip install "transformers==4.51.0" accelerate "torch>=2.3.0,<=2.8.0" "torchaudio<=2.8.0" "minicpmo-utils[all]>=1.0.2"

Model Initialization <!-- omit in toc -->

<details> <summary>Click to show model initialization code.</summary>
import torch
from transformers import AutoModel

# Load omni model (default: init_vision=True, init_audio=True, init_tts=True)
# For vision-only model: set init_audio=False and init_tts=False
# For audio-only model: set init_vision=False
model = AutoModel.from_pretrained(
    "openbmb/MiniCPM-o-4_5",
    trust_remote_code=True,
    attn_implementation="sdpa", # sdpa or flash_attention_2
    torch_dtype=torch.bfloat16,
    init_vision=True,
    init_audio=True,
    init_tts=True,
)
model.eval().cuda()

# Initialize TTS for audio output in chat or streaming mode
model.init_tts(streaming=False)  # or streaming=True

# Convert simplex model to duplex mode
duplex_model = model.as_duplex()

# Convert duplex model back to simplex mode
simplex_model = duplex_model.as_simplex(reset_session=True)
</details>

Duplex Omni Mode <!-- omit in toc -->

Full-duplex streaming inference for real-time or recorded video conversations.

<details> <summary>Click to show duplex omni mode code.</summary>
import librosa
import torch
from minicpmo.utils import generate_duplex_video, get_video_frame_audio_segments
from transformers import AutoModel

# Load model and convert to duplex mode
model = AutoModel.from_pretrained(
    "openbmb/MiniCPM-o-4_5",
    trust_remote_code=True,
    attn_implementation="sdpa",  # or "flash_attention_2"
    torch_dtype=torch.bfloat16,
)
model.eval().cuda()
model = model.as_duplex()

# Load video and reference audio
video_path = "assets/omni_duplex1.mp4"
ref_audio_path = "assets/HT_ref_audio.wav"
ref_audio, _ = librosa.load(ref_audio_path, sr=16000, mono=True)

# Extract video frames and audio segments
video_frames, audio_segments, stacked_frames = get_video_frame_audio_segments(
    video_path, stack_frames=1, use_ffmpeg=True, adjust_audio_length=True
)

# Prepare duplex session with system prompt and voice reference
model.prepare(
    prefix_system_prompt="Streaming Omni Conversation.",
    ref_audio=ref_audio,
    prompt_wav_path=ref_audio_path,
)

results_log = []
timed_output_audio = []

# Process each chunk in streaming fashion
for chunk_idx in range(len(audio_segments)):
    audio_chunk = audio_segments[chunk_idx] if chunk_idx < len(audio_segments) else None
    frame = video_frames[chunk_idx] if chunk_idx < len(video_frames) else None
    frame_list = []
    if frame is not None:
        frame_list.append(frame)
        if stacked_frames is not None and chunk_idx < len(stacked_frames) and stacked_frames[chunk_idx] is not None:
            frame_list.append(stacked_frames[chunk_idx])

    # Step 1: Streaming prefill
    model.streaming_prefill(
        audio_waveform=audio_chunk,
        frame_list=frame_list,
        max_slice_nums=1,  # Increase for HD mode (e.g., [2, 1] for stacked frames)
        batch_vision_feed=False,  # Set True for faster processing
    )

    # Step 2: Streaming generate
    result = model.streaming_generate(
        prompt_wav_path=ref_audio_path,
        max_new_speak_tokens_per_chunk=20,
        decode_mode="sampling",
    )

    if result["audio_waveform"] is not None:
        timed_output_audio.append((chunk_idx, result["audio_waveform"]))

    chunk_result = {
        "chunk_idx": chunk_idx,
        "is_listen": result["is_listen"],
        "text": result["text"],
        "end_of_turn": result["end_of_turn"],
        "current_time": result["current_time"],
        "audio_length": len(result["audio_waveform"]) if result["audio_waveform"] is not None else 0,
    }
    results_log.append(chunk_result)
    
    print("listen..." if result["is_listen"] else f"speak> {result['text']}")

# Generate output video with AI responses
# Please install Chinese fonts (fonts-noto-cjk or fonts-wqy-microhei) to render CJK subtitles correctly.
# apt-get install -y fonts-noto-cjk fonts-wqy-microhei
# fc-cache -fv
generate_duplex_video(
    video_path=video_path,
    output_video_path="duplex_output.mp4",
    results_log=results_log,
    timed_output_audio=timed_output_audio,
    output_sample_rate=24000,
)
</details>

Simplex Omni Mode <!-- omit in toc -->

We provide two inference modes: chat and streaming.

Chat Inference <!-- omit in toc -->

<details> <summary>Click to show chat inference code.</summary>
from minicpmo.utils import get_video_frame_audio_segments

model = ...
model.init_tts(streaming=False)

video_path = "assets/Skiing.mp4"

# Optional: Set reference audio for voice cloning
ref_audio_path = "assets/HT_ref_audio.wav"
sys_msg = model.get_sys_prompt(ref_audio=ref_audio_path, mode="omni", language="en")

# Use stack_frames=5 for high refresh rate mode
video_frames, audio_segments, stacked_frames = get_video_frame_audio_segments(video_path, stack_frames=1)
omni_contents = []
for i in range(len(video_frames)):
    omni_contents.append(video_frames[i])
    omni_contents.append(audio_segments[i])
    if stacked_frames is not None and stacked_frames[i] is not None:
        omni_contents.append(stacked_frames[i])

msg = {"role": "user", "content": omni_contents}
msgs = [sys_msg, msg]

# Set generate_audio=True and output_audio_path to save TTS output
generate_audio = True
output_audio_path = "output.wav"

res = model.chat(
    msgs=msgs,
    max_new_tokens=4096,
    do_sample=True,
    temperature=0.7,
    use_tts_template=True,
    enable_thinking=False,
    omni_mode=True,  # Required for omni inference
    generate_audio=generate_audio,
    output_audio_path=output_audio_path,
    max_slice_nums=1,  # Increase for HD mode
)
print(res)

# Example output: "The person in the picture is skiing down a snowy mountain slope."
# import IPython
# IPython.display.Audio("output.wav")
</details>

Streaming Inference <!-- omit in toc -->

<details> <summary>Click to show streaming inference code.</summary>
import librosa
import numpy as np
import soundfile as sf
import torch
from minicpmo.utils import get_video_frame_audio_segments

model = ...
model.init_tts(streaming=True)

# Reset session for a new conversation (clears KV cache)
model.reset_session()

# Optional: Load reference audio for voice cloning
ref_audio_path = "assets/HT_ref_audio.wav"
ref_audio, _ = librosa.load(ref_audio_path, sr=16000, mono=True)
model.init_token2wav_cache(ref_audio)

session_id = "demo"

# Extract video frames and audio segments (use stack_frames=5 for high refresh rate mode)
video_path = "assets/Skiing.mp4"
video_frames, audio_segments, stacked_frames = get_video_frame_audio_segments(video_path, stack_frames=1)

# Build omni contents list
omni_contents = []
for i in range(len(video_frames)):
    omni_contents.append(video_frames[i])
    omni_contents.append(audio_segments[i])
    if stacked_frames is not None and stacked_frames[i] is not None:
        omni_contents.append(stacked_frames[i])

generate_audio = False
output_audio_path = "output.wav"

# Step 1: Prefill system prompt
sys_msg = model.get_sys_prompt(ref_audio=ref_audio, mode="omni", language="en")
model.streaming_prefill(session_id=session_id, msgs=[sys_msg])

# Step 2: Prefill omni chunks (is_last_chunk=True only for the last audio chunk)
audio_indices = [i for i, c in enumerate(omni_contents) if isinstance(c, np.ndarray)]
last_audio_idx = audio_indices[-1] if audio_indices else -1

for idx, content in enumerate(omni_contents):
    is_last_audio_chunk = idx == last_audio_idx
    msgs = [{"role": "user", "content": [content]}]
    model.streaming_prefill(session_id=session_id, msgs=msgs, omni_mode=True, is_last_chunk=is_last_audio_chunk)

# Step 3: Generate response
iter_gen = model.streaming_generate(
    session_id=session_id,
    generate_audio=generate_audio,
    use_tts_template=True,
    enable_thinking=False,
    do_sample=True,
)

audios = []
text = ""

if generate_audio:
    for wav_chunk, text_chunk in iter_gen:
        audios.append(wav_chunk)
        text += text_chunk

    generated_waveform = torch.cat(audios, dim=-1)[0]
    sf.write(output_audio_path, generated_waveform.cpu().numpy(), samplerate=24000)

    print("Text:", text)
    print("Audio saved to output.wav")
else:
    for text_chunk, is_finished in iter_gen:
        text += text_chunk
    print("Text:", text)
</details>

Simplex Realtime Speech Conversation Mode <!-- omit in toc -->

<details> <summary>Click to show simplex mode realtime speech conversation API usage.</summary>

First, make sure you have all dependencies, especially minicpmo-utils[all]>=1.0.2:

pip install "transformers==4.51.0" accelerate "torch>=2.3.0,<=2.8.0" "torchaudio<=2.8.0" "minicpmo-utils[all]>=1.0.2"
import librosa
import numpy as np
import torch
import soundfile as sf

model = ...

# Set reference audio for voice style
ref_audio_path = "ref_audio_path"
ref_audio, _ = librosa.load(ref_audio_path, sr=16000, mono=True)

# Example system msg for English Conversation
sys_msg = {
  "role": "system",
  "content": [
    "Clone the voice in the provided audio prompt.",
    ref_audio,
    "Please assist users while maintaining this voice style. Please answer the user's questions seriously and in a high quality. Please chat with the user in a highly human-like and oral style. You are a helpful assistant developed by ModelBest: MiniCPM-Omni"
  ]
}

# Example system msg for Chinese Conversation
sys_msg = {
  "role": "system",
  "content": [
    "模仿输入音频中的声音特征。",
    ref_audio,
    "你的任务是用这种声音模式来当一个助手。请认真、高质量地回复用户的问题。请用高自然度的方式和用户聊天。你是由面壁智能开发的人工智能助手:面壁小钢炮。"
  ]
}

# You can use each type of system prompt mentioned above in streaming speech conversation

# Reset state
model.init_tts(streaming=True)
model.reset_session(reset_token2wav_cache=True)
model.init_token2wav_cache(prompt_speech_16k=ref_audio)

session_id = "demo"

# First, prefill system turn
model.streaming_prefill(
    session_id=session_id,
    msgs=[sys_msg],
    omni_mode=False,
    is_last_chunk=True,
)

# Here we simulate realtime speech conversation by splitting whole user input audio into chunks of 1s.
user_audio, _ = librosa.load("user_audio.wav", sr=16000, mono=True)

IN_SAMPLE_RATE = 16000 # input audio sample rate, fixed value
CHUNK_SAMPLES = IN_SAMPLE_RATE # sample
OUT_SAMPLE_RATE = 24000 # output audio sample rate, fixed value

total_samples = len(user_audio)
num_chunks = (total_samples + CHUNK_SAMPLES - 1) // CHUNK_SAMPLES

for chunk_idx in range(num_chunks):
    start = chunk_idx * CHUNK_SAMPLES
    end = min((chunk_idx + 1) * CHUNK_SAMPLES, total_samples)
    chunk_audio = user_audio[start:end]
    
    is_last_chunk = (chunk_idx == num_chunks - 1)
    
    user_msg = {"role": "user", "content": [chunk_audio]}
    
    # For each 1s audio chunk, perform streaming_prefill once to reduce first-token latency
    model.streaming_prefill(
        session_id=session_id,
        msgs=[user_msg],
        omni_mode=False,
        is_last_chunk=is_last_chunk,
    )

# Let model generate response in a streaming manner
generate_audio = True
iter_gen = model.streaming_generate(
    session_id=session_id,
    generate_audio=generate_audio,
    use_tts_template=True,
    enable_thinking=False,
    do_sample=True,
    max_new_tokens=512,
    length_penalty=1.1, # For realtime speech conversation mode, we suggest length_penalty=1.1 to improve response content
)

audios = []
text = ""

if generate_audio:
    for wav_chunk, text_chunk in iter_gen:
        audios.append(wav_chunk)
        text += text_chunk

    generated_waveform = torch.cat(audios, dim=-1)[0]
    sf.write(output_audio_path, generated_waveform.cpu().numpy(), samplerate=24000)

    print("Text:", text)
    print("Audio saved to output.wav")
else:
    for text_chunk, is_finished in iter_gen:
        text += text_chunk
    print("Text:", text)

# Now we can prefill the following user turns and generate next turn response...

</details>

Speech Conversation as a Versatile and Vibe AI Assistant <!-- omit in toc -->

Built on carefully designed post-training data and professional voice-actor recordings, MiniCPM-o-4.5 can also function as an AI voice assistant. It delivers high-quality spoken interaction out of the box. It produces a sweet and expressive voice with natural prosody, including appropriate rhythm, stress, and pauses, giving a strong sense of liveliness in casual conversation. It also supports storytelling and narrative speech with coherent and engaging delivery. Moreover, it enables advanced voice instruction control. like emotional tone, word-level emphasis.

<details> <summary>Click to show AI assistant conversation code.</summary>
import librosa

# Set reference audio for voice style
ref_audio_path = "assets/HT_ref_audio.wav"
ref_audio, _ = librosa.load(ref_audio_path, sr=16000, mono=True)

# For Chinese Conversation
sys_msg = {
  "role": "system",
  "content": [
    "模仿输入音频中的声音特征。",
    ref_audio,
    "你的任务是用这种声音模式来当一个助手。请认真、高质量地回复用户的问题。请用高自然度的方式和用户聊天。你是由面壁智能开发的人工智能助手:面壁小钢炮。"
  ]
}

# For English Conversation
sys_msg = {
  "role": "system",
  "content": [
    "Clone the voice in the provided audio prompt.",
    ref_audio,
    "Please assist users while maintaining this voice style. Please answer the user's questions seriously and in a high quality. Please chat with the user in a highly human-like and oral style. You are a helpful assistant developed by ModelBest: MiniCPM-Omni."
  ]
}
</details>

General Speech Conversation with Custom Voice and Custom System Profile <!-- omit in toc -->

MiniCPM-o-4.5 can role-play as a specific character based on an audio prompt and text profile prompt. It mimics the character's voice and adopts their language style in text responses. It also follows profile defined in text profile. In this mode, MiniCPM-o-4.5 sounds more natural and human-like.

<details> <summary>Click to show custom voice conversation code.</summary>
import librosa

# Set reference audio for voice cloning
ref_audio_path = "assets/system_ref_audio.wav"
ref_audio, _ = librosa.load(ref_audio_path, sr=16000, mono=True)

# For English conversation with text profile
sys_msg = {
  "role": "system",
  "content": [
    "Clone the voice in the provided audio prompt.",
    ref_audio,
    "Please chat with the user in a highly human-like and oral style." + "You are Elon Musk, CEO of Tesla and SpaceX. You speak directly and casually, often with dry humor. You're passionate about Mars, sustainable energy, and pushing humanity forward. Speak bluntly with occasional dark humor. Use simple logic and don't sugarcoat things. Don't be diplomatic. Say what you actually think, even if it's controversial. Keep responses around 100 words. Don't ramble."
  ]
}


# For English conversation with no text profile
sys_msg = {
  "role": "system",
  "content": [
    "Clone the voice in the provided audio prompt.",
    ref_audio,
    "Your task is to be a helpful assistant using this voice pattern. Please answer the user's questions seriously and in a high quality. Please chat with the user in a high naturalness style."
  ]
}

# For Chinese Conversation with no text profile
sys_msg = {
  "role": "system",
  "content": [
    "根据输入的音频提示生成相似的语音。",
    librosa.load("assets/system_ref_audio_2.wav", sr=16000, mono=True)[0],
    "作为助手,你将使用这种声音风格说话。 请认真、高质量地回复用户的问题。 请用高自然度的方式和用户聊天。"
  ]
}

# For Chinese Conversation with text profile
sys_msg = {
  "role": "system",
  "content": [
    "根据输入的音频提示生成相似的语音。",
    ref_audio,
    "你是一个具有以上声音风格的AI助手。请用高拟人度、口语化的方式和用户聊天。" + "你是一名心理咨询师兼播客主理人,热爱创作与深度对话。你性格细腻、富有共情力,善于从个人经历中提炼哲思。语言风格兼具理性与诗意,常以隐喻表达内在体验。"
  ]
}

</details>

Speech and Audio Mode <!-- omit in toc -->

Zero-shot Text-to-speech (TTS) <!-- omit in toc -->

MiniCPM-o-4.5 supports zero-shot text-to-speech (TTS). In this mode, the model functions as a highly-natural TTS system that can replicate a reference voice.

<details> <summary>Click to show TTS code.</summary>
import librosa

model = ...
model.init_tts(streaming=False)

# For both Chinese and English
ref_audio_path = "assets/HT_ref_audio.wav"
ref_audio, _ = librosa.load(ref_audio_path, sr=16000, mono=True)
sys_msg = {"role": "system", "content": [
  "模仿音频样本的音色并生成新的内容。",
  ref_audio,
  "请用这种声音风格来为用户提供帮助。 直接作答,不要有冗余内容"
]}

# For English
user_msg = {
  "role": "user",
  "content": [
    "请朗读以下内容。" + " " + "I have a wrap up that I want to offer you now, a conclusion to our work together."
  ]
}

# For Chinese
user_msg = {
  "role": "user",
  "content": [
    "请朗读以下内容。" + " " + "你好,欢迎来到艾米说科幻,我是艾米。"
  ]
}

msgs = [sys_msg, user_msg]
res = model.chat(
    msgs=msgs,
    do_sample=True,
    max_new_tokens=512,
    use_tts_template=True,
    generate_audio=True,
    temperature=0.1,
    output_audio_path="result_voice_cloning.wav",
)
</details>

Mimick <!-- omit in toc -->

The Mimick task evaluates a model's end-to-end speech modeling capability. The model takes audio input, transcribes it, and reconstructs the original audio with high fidelity, preserving detailed acoustic, paralinguistic, and semantic information. Higher similarity between the reconstructed and original audio indicates stronger end-to-end speech modeling capability.

<details> <summary>Click to show mimick code.</summary>
import librosa

model = ...
model.init_tts(streaming=False)

system_prompt = "You are a helpful assistant. You can accept video, audio, and text input and output voice and text. Respond with just the answer, no redundancy."

mimick_prompt = "Please repeat the following speech in the appropriate language."

audio_input, _ = librosa.load("assets/Trump_WEF_2018_10s.mp3", sr=16000, mono=True)

msgs = [
    {"role": "system", "content": [system_prompt]},
    {"role": "user", "content": [mimick_prompt, audio_input]}
  ]

res = model.chat(
    msgs=msgs,
    do_sample=True,
    max_new_tokens=512,
    use_tts_template=True,
    temperature=0.1,
    generate_audio=True,
    output_audio_path="output_mimick.wav",
)
</details>

Addressing Various Audio Understanding Tasks <!-- omit in toc -->

MiniCPM-o-4.5 can also handle various audio understanding tasks, such as ASR, speaker analysis, general audio captioning, and sound scene tagging.

For audio-to-text tasks, you can use the following prompts:

  • ASR (Chinese, or AST EN→ZH): 请仔细听这段音频片段,并将其内容逐字记录。
  • ASR (English, or AST ZH→EN): Please listen to the audio snippet carefully and transcribe the content.
  • Speaker Analysis: Based on the speaker's content, speculate on their gender, condition, age range, and health status.
  • General Audio Caption: Summarize the main content of the audio.
  • Sound Scene Tagging: Utilize one keyword to convey the audio's content or the associated scene.
<details> <summary>Click to show audio understanding code.</summary>
import librosa

model = ...
model.init_tts(streaming=False)

# Load the audio to be transcribed/analyzed
audio_input, _ = librosa.load("assets/Trump_WEF_2018_10s.mp3", sr=16000, mono=True)

# Choose a task prompt (see above for options)
task_prompt = "Please listen to the audio snippet carefully and transcribe the content.\n"
msgs = [{"role": "user", "content": [task_prompt, audio_input]}]

res = model.chat(
    msgs=msgs,
    do_sample=True,
    max_new_tokens=512,
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
    output_audio_path="result_audio_understanding.wav",
)
print(res)
</details>

Visual Understanding <!-- omit in toc -->

MiniCPM-o-4.5 shares the same inference methods as MiniCPM-V-4.5.

Chat with Single Image <!-- omit in toc -->

<details> <summary>Click to show single image chat code.</summary>
import torch
from PIL import Image
from transformers import AutoModel

model = AutoModel.from_pretrained(
    "openbmb/MiniCPM-o-4_5",
    trust_remote_code=True,
    attn_implementation="sdpa",  # or "flash_attention_2"
    torch_dtype=torch.bfloat16,
    init_vision=True,
    init_audio=False,
    init_tts=False,
)
model.eval().cuda()

image = Image.open("assets/fossil.png").convert("RGB")
question = "What is in the image?"
msgs = [{"role": "user", "content": [image, question]}]

enable_thinking=False # If `enable_thinking=True`, the thinking mode is enabled.
stream=False # If `stream=True`, return string generator

res = model.chat(msgs=msgs, use_tts_template=False, enable_thinking=enable_thinking, stream=stream)
print(res)
</details>

Chat with Multiple Images <!-- omit in toc -->

<details> <summary>Click to show Python code for multi-image input.</summary>
import torch
from PIL import Image
from transformers import AutoModel

model = ...

image1 = Image.open("assets/highway.png").convert("RGB")
image2 = Image.open("assets/fossil.png").convert("RGB")
question = "Compare image 1 and image 2, tell me about the differences between them."
msgs = [{"role": "user", "content": [image1, image2, question]}]

answer = model.chat(msgs=msgs, use_tts_template=False, enable_thinking=False)
print(answer)
</details>

In-Context Few-Shot Learning <!-- omit in toc -->

<details> <summary>Click to show Python code for few-shot learning.</summary>
from PIL import Image

model = ...

question = "production date"
image1 = Image.open("example1.jpg").convert("RGB")
answer1 = "2023.08.04"
image2 = Image.open("example2.jpg").convert("RGB")
answer2 = "2007.04.24"
image_test = Image.open("test.jpg").convert("RGB")

msgs = [
    {"role": "user", "content": [image1, question]},
    {"role": "assistant", "content": [answer1]},
    {"role": "user", "content": [image2, question]},
    {"role": "assistant", "content": [answer2]},
    {"role": "user", "content": [image_test, question]},
]

answer = model.chat(msgs=msgs, use_tts_template=False, enable_thinking=False)
print(answer)
</details>

Chat with Video <!-- omit in toc -->

<details> <summary>Click to show Python code for video input.</summary>
import torch
from minicpmo.utils import get_video_frame_audio_segments
from transformers import AutoModel

model = ...

video_path = "assets/Skiing.mp4"
video_frames, _, _ = get_video_frame_audio_segments(video_path)
print("num frames:", len(video_frames))

question = "Describe the video"
msgs = [{"role": "user", "content": video_frames + [question]}]

answer = model.chat(
    msgs=msgs,
    max_new_tokens=128,
    use_image_id=False,
    max_slice_nums=1,
    use_tts_template=False,
    enable_thinking=False,  # Set True to enable thinking mode
)
print(answer)
</details>

Structured Content Input <!-- omit in toc -->

<details> <summary>Click to show structured content input details.</summary>

The chat method accepts message content in two formats:

Native format – pass Python objects directly:

msgs = [{"role": "user", "content": [pil_image, audio_ndarray, "Describe this."]}]

OpenAI-compatible format – use structured dictionaries:

msgs = [
    {
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": "/path/to/image.jpg"}},
            {"type": "audio_url", "audio_url": {"url": "/path/to/audio.wav"}},
            {"type": "video_url", "video_url": {"url": "/path/to/video.mp4", "use_audio": True}},
            {"type": "text", "text": "Describe this."}
        ]
    }
]

Supported types:

| Type | Input | Converts to | |------|-------|-------------| | text | {"type": "text", "text": "..."} | str | | image_url | {"type": "image_url", "image_url": {"url": "..."}} | PIL.Image | | audio_url | {"type": "audio_url", "audio_url": {"url": "..."}} | np.ndarray (16kHz mono) | | video_url | {"type": "video_url", "video_url": {"url": "...", "stack_frames": 1, "use_audio": True}} | List[Image, ndarray, ...] |

  • URL sources: local file paths or http:///https:// URLs
  • Mixed formats: native objects and structured dicts can be combined in the same content list
</details>

FlagOS

<details> <summary>Click to show FlagOS Usage details.</summary>

Official website: https://flagos.io.

From Scratch <!-- omit in toc -->
  • Dependencies: Python 3.12, GLIBC 2.39, GLIBCXX 3.4.33, CXXABI 1.3.15
Transformers <!-- omit in toc -->
  • Installing the FlagOS Operator Library

    Official Repository: https://github.com/flagos-ai/FlagGems

    pip install flag-gems==4.2.1rc0
    
  • Installing the FlagOS Compiler

    Official Repository: https://github.com/flagos-ai/flagtree

    Quick Reference for Core Dependency Versions: https://github.com/flagos-ai/FlagTree/blob/main/documents/build.md#tips-for-building

    pip uninstall triton
    
    python3 -m pip install flagtree==0.4.0+3.5 --index-url=https://resource.flagos.net/repository/flagos-pypi-hosted/simple --trusted-host=https://resource.flagos.net
    
  • Activating Acceleration

    Add USE_FLAGOS=1 before the command for the task you want to run. For example, when you run:

    python3 generate_speech_from_video.py
    

    To use the MiniCPM-o-4.5 model to generate spoken responses from video content, you can:

    USE_FLAGOS=1 python3 generate_speech_from_video.py
    

    to accelerate this process with FlagOS.

vLLM Version <!-- omit in toc -->
  • Installing the FlagOS Operator Library

    Official Repository: https://github.com/flagos-ai/FlagGems

    pip install flag-gems==4.2.1rc0
    pip install triton==3.5.1
    
  • Activating Acceleration

    Add USE_FLAGOS=1 before the command for the task you want to run. For example, when you run:

    vllm serve ${model_path} --dtype auto  --gpu_memory_utilization 0.9 --trust-remote-code --max-num-batched-tokens 2048 --served-model-name cpmo --port ${Port}
    

    To start the MiniCPM-o-4.5 server, you can:

    USE_FLAGOS=1 vllm serve ${model_path} --dtype auto  --gpu_memory_utilization 0.9 --trust-remote-code --max-num-batched-tokens 2048 --served-model-name cpmo --port ${Port}
    

    to accelerate this process with FlagOS.

</details>

MiniCPM-V & o Cookbook

Discover comprehensive, ready-to-deploy solutions for the MiniCPM-V and MiniCPM-o model series in our structured cookbook, which empowers developers to rapidly implement multimodal AI applications with integrated vision, speech, and live-streaming capabilities. Key features include:

Easy Usage Documentation

Our comprehensive documentation website presents every recipe in a clear, well-organized manner. All features are displayed at a glance, making it easy for you to quickly find exactly what you need.

Broad User Spectrum

We support a wide range of users, from individuals to enterprises and researchers.

  • Individuals: Enjoy effortless inference using Ollama and Llama.cpp with minimal setup.
  • Enterprises: Achieve high-throughput, scalable performance with vLLM and SGLang.
  • Researchers: Leverage advanced frameworks including Transformers, LLaMA-Factory, SWIFT, and Align-anything to enable flexible model development and cutting-edge experimentation.

Versatile Deployment Scenarios

Our ecosystem delivers optimal solution for a variety of hardware environments and deployment demands.

  • Web Demo: Full-duplex real-time video interaction solution with high responsiveness and low latency. WebRTC_Demo.
  • Quantized deployment: Maximize efficiency and minimize resource consumption using GGUF and BNB.
  • End devices: Bring powerful AI experiences to iPhone and iPad, supporting offline and privacy-sensitive applications.

License

Model License

  • The MiniCPM-o/V model weights and code are open-sourced under the Apache-2.0 license.

Statement

  • As an LMM, MiniCPM-o 4.5 generates contents by learning a large amount of multimodal corpora, but it cannot comprehend, express personal opinions or make value judgement. Anything generated by MiniCPM-o 4.5 does not represent the views and positions of the model developers
  • We will not be liable for any problems arising from the use of the MinCPM-o models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model.

Key Techniques and Other Multimodal Projects <!-- omit in toc -->

👏 Welcome to explore key techniques of MiniCPM-o/V and other multimodal projects of our team:

VisCPM | RLPR | RLHF-V | LLaVA-UHD | RLAIF-V

Citation <!-- omit in toc -->

If you find our model/code/paper helpful, please consider citing our papers 📝 and staring us ⭐️!

@article{yao2024minicpm,
  title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone},
  author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others},
  journal={arXiv preprint arXiv:2408.01800},
  year={2024}
}

Author: openbmb

Likes: 52

Downloads: 0

Tags: transformers, onnx, safetensors, minicpmo, feature-extraction, minicpm-o, minicpm-v, multimodal, full-duplex, any-to-any, custom_code, arxiv:2408.01800, license:apache-2.0, region:us

meituan-longcat/LongCat-Image-Edit-Turbo


license: apache-2.0 language:

  • en
  • zh pipeline_tag: image-to-image library_name: transformers

<div align="center"> <img src="assets/longcat-image_logo.svg" width="45%" alt="LongCat-Image" /> </div> <hr> <div align="center" style="line-height: 1;"> <a href='https://arxiv.org/pdf/2512.07584'><img src='https://img.shields.io/badge/Technical-Report-red'></a> <a href='https://github.com/meituan-longcat/LongCat-Image'><img src='https://img.shields.io/badge/GitHub-Code-black'></a> <a href='https://github.com/meituan-longcat/LongCat-Flash-Chat/blob/main/figures/wechat_official_accounts.png'><img src='https://img.shields.io/badge/WeChat-LongCat-brightgreen?logo=wechat&logoColor=white'></a> <a href='https://x.com/Meituan_LongCat'><img src='https://img.shields.io/badge/Twitter-LongCat-white?logo=x&logoColor=white'></a> </div> <div align="center" style="line-height: 1;">

<a href='https://huggingface.co/meituan-longcat/LongCat-Image'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-LongCat--Image-blue'></a> <a href='https://huggingface.co/meituan-longcat/LongCat-Image-Dev'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-LongCat--Image--Dev-blue'></a> <a href='https://huggingface.co/meituan-longcat/LongCat-Image-Edit'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-LongCat--Image--Edit-blue'></a> <a href='https://huggingface.co/meituan-longcat/LongCat-Image-Edit-Turbo'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-LongCat--Image--Edit--Turbo-blue'></a>

</div>

Introduction

We introduce LongCat-Image-Edit-Turbo, the distilled version of LongCat-Image-Edit. It achieves high-quality image editing with only 8 NFEs (Number of Function Evaluations) , offering extremely low inference latency.

<div align="center"> <img src="assets/model_struct_edit.png" width="90%" alt="LongCat-Image-Edit model" /> </div>

Installation

pip install git+https://github.com/huggingface/diffusers

Run Image Editing

[!CAUTION] 📝 Special Handling for Text Rendering

For both Text-to-Image and Image Editing tasks involving text generation, you must enclose the target text within single or double quotation marks (both English '...' / "..." and Chinese ‘...’ / “...” styles are supported).

Reasoning: The model utilizes a specialized character-level encoding strategy specifically for quoted content. Failure to use explicit quotation marks prevents this mechanism from triggering, which will severely compromise the text rendering capability.

import torch
from PIL import Image
from diffusers import LongCatImageEditPipeline

if __name__ == '__main__':
    device = torch.device('cuda')
    pipe = LongCatImageEditPipeline.from_pretrained("meituan-longcat/LongCat-Image-Edit-Turbo", torch_dtype= torch.bfloat16 )
    # pipe.to(device, torch.bfloat16)  # Uncomment for high VRAM devices (Faster inference)
    pipe.enable_model_cpu_offload()  # Offload to CPU to save VRAM (Required ~18 GB); slower but prevents OOM
    img = Image.open('assets/test.png').convert('RGB')
    prompt = '将猫变成狗'
    image = pipe(
        img,
        prompt,
        negative_prompt='',
        guidance_scale=1,
        num_inference_steps=8,
        num_images_per_prompt=1,
        generator=torch.Generator("cpu").manual_seed(43)
    ).images[0]
    image.save('./edit_example.png')

Author: meituan-longcat

Likes: 25

Downloads: 0

Tags: transformers, diffusers, safetensors, image-to-image, en, zh, arxiv:2512.07584, license:apache-2.0, endpoints_compatible, region:us

Comfy-Org/ace_step_1.5_ComfyUI_files

These are repackaged files to work with ComfyUI, original model repo is: https://huggingface.co/ACE-Step/Ace-Step1.5

Author: Comfy-Org

Likes: 24

Downloads: 0

Tags: region:us

Yuliang101/ConceptLM_Llama3.1_8B

ConceptLM-Llama3.1-8B, continual pre-trained from Llama-3.1-8B with 9.6B tokens.

| Benchmark | Setting | Metric | Result | |-----------------|---------|--------|--------| | MMLU | 5-shot | ACC | 64.62 | | AGIEval | 3-shot | ACC | 35.11 | | ARC-Challenge | 25-shot | ACC | 54.18 | | SQuAD 2.0 | 5-shot | F1 / EM| 36.51 |

| Tasks |Version|Filter|n-shot| Metric | | Value | |Stderr| |-------|------:|------|-----:|------------|---|------:|---|------| |squadv2| 3|none | 1|HasAns_exact|↑ |63.8495|± | N/A| | | |none | 1|HasAns_f1 |↑ |73.1161|± | N/A| | | |none | 1|NoAns_exact |↑ | 0.0000|± | N/A| | | |none | 1|NoAns_f1 |↑ | 0.0000|± | N/A| | | |none | 1|best_exact |↑ |50.6022|± | N/A| | | |none | 1|best_f1 |↑ |52.9611|± | N/A| | | |none | 1|exact |↑ |31.8791|± | N/A| | | |none | 1|f1 |↑ |36.5057|± | N/A|

| Groups |Version|Filter|n-shot|Metric| |Value | |Stderr| |------------------|------:|------|------|------|---|-----:|---|-----:| |mmlu | 2|none | |acc |↑ |0.6462|± |0.0038| | - humanities | 2|none | |acc |↑ |0.5955|± |0.0068| | - other | 2|none | |acc |↑ |0.7119|± |0.0078| | - social sciences| 2|none | |acc |↑ |0.7527|± |0.0076| | - stem | 2|none | |acc |↑ |0.5531|± |0.0085|

Your can reproduce our results using our released code.

Author: Yuliang101

Likes: 6

Downloads: 0

Tags: safetensors, llama, region:us

Yuliang101/ConceptLM_Pythia_410M

ConceptLM-Pythia-410M, pre-trained from scratch with 300B tokens from the Pile.

| Tasks |Version|Filter|n-shot| Metric | | Value | |Stderr| |----------------|------:|------|-----:|---------------|---|------:|---|------| |arc_challenge | 1|none | 0|acc |↑ | 0.2210|± |0.0121| | | |none | 0|acc_norm |↑ | 0.2645|± |0.0129| |arc_easy | 1|none | 0|acc |↑ | 0.5202|± |0.0103| | | |none | 0|acc_norm |↑ | 0.4689|± |0.0102| |hellaswag | 1|none | 0|acc |↑ | 0.3434|± |0.0047| | | |none | 0|acc_norm |↑ | 0.4101|± |0.0049| |lambada_openai | 1|none | 0|acc |↑ | 0.5304|± |0.0070| | | |none | 0|perplexity |↓ | 9.8351|± |0.2845| |lambada_standard| 1|none | 0|acc |↑ | 0.3971|± |0.0068| | | |none | 0|perplexity |↓ |23.1996|± |0.8277| |piqa | 1|none | 0|acc |↑ | 0.6659|± |0.0110| | | |none | 0|acc_norm |↑ | 0.6687|± |0.0110| |race | 2|none | 0|acc |↑ | 0.3024|± |0.0142| |sciq | 1|none | 0|acc |↑ | 0.8220|± |0.0121| | | |none | 0|acc_norm |↑ | 0.7650|± |0.0134| |wikitext | 2|none | 0|bits_per_byte |↓ | 0.8012|± | N/A| | | |none | 0|byte_perplexity|↓ | 1.7425|± | N/A| | | |none | 0|word_perplexity|↓ |19.4855|± | N/A| |winogrande | 1|none | 0|acc |↑ | 0.5414|± |0.0140|

Author: Yuliang101

Likes: 5

Downloads: 0

Tags: pytorch, gpt_neox, region:us

allenai/SERA-14B


license: apache-2.0

SERA-14B

SERA-14B is the fifth model in Ai2's Open Coding Agents series. It is a state-of-the-art 14B open-source coding agent that achieves 41.7% on SWE-bench Verified, outperforming or matching larger models like DeepSWE-Preview and SkyRL-Agent.

Model Variants

| Model | HuggingFace | Base | Teacher | SWE-bench Verified | |-------|-------------|------|---------|-------------------| | SERA-32B | allenai/SERA-32B | Qwen 3-32B | GLM-4.6 | 49.5% ± 1.9% | | SERA-32B-GA | allenai/SERA-32B-GA | Qwen 3-32B | GLM-4.5-Air | 46.6% ± 0.7% | | SERA-14B | allenai/SERA-14B | Qwen 3-14B | GLM-4.6 | 41.7% ± 0.5% | | SERA-8B | allenai/SERA-8B | Qwen 3-8B | GLM-4.6 | 31.7% ± 0.9% | | SERA-8B-GA | allenai/SERA-8B-GA | Qwen 3-8B | GLM-4.5-Air | 31.7% ± 0.4% |

All results evaluated at 32K context length. Standard deviations computed over 3 random seeds.

Performance

SWE-bench Verified (32K Context)

| Model | Type | Resolve Rate | |-------|------|--------------| | SkyRL-8B | Open-source | 9.4% | | Nex-N1-8B | Open-source | 20.3% | | SERA-8B | Open-source | 31.7% | | Qwen 3-32B (base) | Open-weight | 24.4% | | SWE-smith | Open-source | 32.6% | | SkyRL-Agent | Open-source | 39.4% | | SERA-14B | Open-source | 41.7% | | DeepSWE | Open-source | 42.2% | | SERA-32B | Open-source | 49.5% | | Devstral-Small-2 (24B) | Open-weight | 50.0% | | GLM-4.5-Air (110B) | Open-weight | 50.5% |

Open-source: code, model weights, and data publicly available. Open-weight: model weights available but training data/code not fully released.

Quickstart

The easiest way to use SERA is with the sera CLI, which provides seamless integration with Claude Code:

# Install the CLI
uv tool install ai2-sera-cli

# Option 1: Deploy on Modal (recommended for trying out)
modal setup  # one-time setup
sera --modal

# Option 2: Use an existing endpoint
export SERA_API_KEY=<your_api_key>
sera --endpoint <endpoint_url>

The first run with --modal takes approximately 10 minutes to download the model (~65GB) and compile. Subsequent runs start in 1-2 minutes.

For more deployment options, see the sera-cli documentation.

Self Hosting

vllm serve allenai/SERA-14B --port 8001 \
            --tensor-parallel-size 4 \
            --max-model-len 32768 \
            --trust-remote-code \
            --enable-auto-tool-choice \
            --tool-call-parser hermes \
            --enforce-eager \
            --seed 42 \
            --disable-cascade-attn

Model Details

| | | |---|---| | Developer | Allen Institute for AI (Ai2) | | Authors | Ethan Shen, Daniel Tormoen, Saurabh Shah, Ali Farhadi, Tim Dettmers | | Base Model | Qwen 3-14B | | Teacher Model | GLM-4.6 (357B) | | Model Type | Coding agent / Software engineering agent | | Training Method | Supervised fine-tuning on synthetic agent trajectories | | Context Length | 32K tokens | | License | Apache 2.0 |

Training Configuration

| | | |---|---| | Epochs | 3 | | Learning Rate | 1e-5 | | Weight Decay | 0.01 | | Max Sequence Length | 32,768 tokens | | Training Framework | Axolotl | | Inference Framework | vLLM |

Training Data

SERA-14B is trained on 25,000 synthetic coding agent trajectories generated using Soft Verified Generation (SVG). SVG is a two-rollout pipeline:

  1. First rollout: A teacher model makes a change to a codebase starting from a randomly selected function
  2. Synthetic PR: The trajectory is converted into a pull request description
  3. Second rollout: The teacher attempts to reproduce the change given only the PR description
  4. Soft verification: Patches are compared using line-level recall (no test execution required)

This approach removes the need for test infrastructure and enables data generation from any repository.

  • Source Repositories: 121 Python codebases
  • Teacher Model: GLM-4.6 (357B)

Intended Use

  • Automated software engineering: Bug fixes, feature implementation, refactoring
  • Repository specialization: Fine-tune on private codebases to create specialized coding agents (~8,000 trajectories / $1,300)
  • Research: Studying coding agents, data generation methods, and agent behavior

Limitations

  • SWE-bench training artifact: The model was trained on SWE-bench-style tasks and may attempt to call a nonexistent submit tool when finished editing. The sera-cli proxy handles this automatically.
  • Evaluation scope: Only validated on SWE-bench Verified (Python repositories). Performance on other languages or benchmarks is unknown.
  • Teacher bound: Performance is largely bounded by the teacher model (GLM-4.6) capability.
  • Statistical variance: Results computed over 3 seeds. Effects smaller than 2-3% should be interpreted with caution.
  • Model-specific: Experiments use Qwen 3 as the base model. Generalization to other model families is not validated.

Bias, Risks, and Limitations

Like any language model without safety filtering, SERA can be prompted to generate harmful or insecure code. Users should be aware of the following risks:

  • Code security: May generate code with security vulnerabilities (e.g., injection attacks, insecure defaults). All generated code should be reviewed before deployment.
  • Accuracy: May produce incorrect or buggy code. Outputs should be tested and verified.
  • Inherited biases: May reflect biases present in the Qwen 3-14B base model and GLM-4.6 teacher model.
  • Misuse potential: Could potentially be used to generate malicious code or identify vulnerabilities for exploitation.

Responsible Use

This model is intended for research and educational use. Users should adhere to Ai2's Responsible Use Guidelines. Key principles include:

  • Use the model for beneficial purposes
  • Review and test all generated code before deployment
  • Do not use to generate malicious software or exploit vulnerabilities
  • Consider the potential impact of automated code generation in your context

Hardware Requirements

| Configuration | GPU | Notes | |--------------|-----|-------| | Minimum | 1× 80GB GPU (A100, H100) | 32K context | | Recommended | 1× H100 | Best performance |

Quantization (AWQ, GPTQ) can reduce memory requirements if needed.

License

This model is licensed under Apache 2.0. It is intended for research and educational use and may be used commercially in accordance with Ai2's Responsible Use Guidelines.

Citation

@misc{shen2026serasoftverifiedefficientrepository,
      title={SERA: Soft-Verified Efficient Repository Agents}, 
      author={Ethan Shen and Danny Tormoen and Saurabh Shah and Ali Farhadi and Tim Dettmers},
      year={2026},
      eprint={2601.20789},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2601.20789}, 
}

Contact

  • Email: ethans03@cs.washington.edu, dettmers@cmu.edu
  • Issues: GitHub Issues

SERA / Open Coding Agents - Disclaimer Text

Bias, Risks, and Limitations SERA-32B/SERA-14B/SERA-8B is an open coding agent model released for research and educational purposes without any safety filtering or safety tuning. As a research artifact, this model is not suitable for real-world use without significant human oversight. Like other coding agents, this model may propagate biases present in training data or generate incorrect or insecure code. Security risks include prompt injection and data leakage. Always verify code outputs and manage context windows to avoid disclosing sensitive data or information.

Bias, Risks, and Limitations Like any base language model or fine-tuned model without safety filtering, these models can easily be prompted by users to generate harmful and sensitive content. Such content may also be produced unintentionally, especially in cases involving bias, so we recommend that users consider the risks when applying this technology. Additionally, many statements from OLMo or any LLM are often inaccurate, so facts should be verified. License This model is licensed under Apache 2.0. It is intended for research and educational use in accordance with Ai2's Responsible Use Guidelines.

Author: allenai

Likes: 5

Downloads: 0

Tags: safetensors, qwen3, arxiv:2601.20789, license:apache-2.0, region:us

unsloth/GLM-OCR


license: mit language:

  • zh
  • en
  • fr
  • es
  • ru
  • de
  • ja
  • ko pipeline_tag: image-to-text library_name: transformers base_model:
  • zai-org/GLM-OCR

GLM-OCR

<div align="center"> <img src=https://raw.githubusercontent.com/zai-org/GLM-OCR/refs/heads/main/resources/logo.svg width="40%"/> </div> <p align="center"> 👋 Join our <a href="https://raw.githubusercontent.com/zai-org/GLM-OCR/refs/heads/main/resources/wechat.jpg" target="_blank">WeChat</a> and <a href="https://discord.gg/QR7SARHRxK" target="_blank">Discord</a> community <br> 📍 Use GLM-OCR's <a href="https://docs.z.ai/guides/vlm/glm-ocr" target="_blank">API</a> </p>

Introduction

GLM-OCR is a multimodal OCR model for complex document understanding, built on the GLM-V encoder–decoder architecture. It introduces Multi-Token Prediction (MTP) loss and stable full-task reinforcement learning to improve training efficiency, recognition accuracy, and generalization. The model integrates the CogViT visual encoder pre-trained on large-scale image–text data, a lightweight cross-modal connector with efficient token downsampling, and a GLM-0.5B language decoder. Combined with a two-stage pipeline of layout analysis and parallel recognition based on PP-DocLayout-V3, GLM-OCR delivers robust and high-quality OCR performance across diverse document layouts.

Key Features

  • State-of-the-Art Performance: Achieves a score of 94.62 on OmniDocBench V1.5, ranking #1 overall, and delivers state-of-the-art results across major document understanding benchmarks, including formula recognition, table recognition, and information extraction.

  • Optimized for Real-World Scenarios: Designed and optimized for practical business use cases, maintaining robust performance on complex tables, code-heavy documents, seals, and other challenging real-world layouts.

  • Efficient Inference: With only 0.9B parameters, GLM-OCR supports deployment via vLLM, SGLang, and Ollama, significantly reducing inference latency and compute cost, making it ideal for high-concurrency services and edge deployments.

  • Easy to Use: Fully open-sourced and equipped with a comprehensive SDK and inference toolchain, offering simple installation, one-line invocation, and smooth integration into existing production pipelines.

Performance

  • Document Parsing & Information Extraction

image

  • Real-World Scenarios Performance

image

  • Speed Test

For speed, we compared different OCR methods under identical hardware and testing conditions (single replica, single concurrency), evaluating their performance in parsing and exporting Markdown files from both image and PDF inputs. Results show GLM-OCR achieves a throughput of 1.86 pages/second for PDF documents and 0.67 images/second for images, significantly outperforming comparable models.

image

Usage

vLLM

  1. run
pip install -U vllm --extra-index-url https://wheels.vllm.ai/nightly

or using docker with:

docker pull vllm/vllm-openai:nightly
  1. run with:
pip install git+https://github.com/huggingface/transformers.git
vllm serve zai-org/GLM-OCR  --allowed-local-media-path /  --port 8080

SGLang

  1. using docker with:
docker pull lmsysorg/sglang:dev

or build it from source with:

pip install git+https://github.com/sgl-project/sglang.git#subdirectory=python
  1. run with:
pip install git+https://github.com/huggingface/transformers.git
python -m sglang.launch_server --model zai-org/GLM-OCR --port 8080

Ollama

  1. Download Ollama.
  2. run with:
ollama run glm-ocr

Ollama will automatically use image file path when an image is dragged into the terminal:

ollama run glm-ocr Text Recognition: ./image.png

Transformers

pip install git+https://github.com/huggingface/transformers.git
from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

MODEL_PATH = "zai-org/GLM-OCR"
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "url": "test_image.png"
            },
            {
                "type": "text",
                "text": "Text Recognition:"
            }
        ],
    }
]
processor = AutoProcessor.from_pretrained(MODEL_PATH)
model = AutoModelForImageTextToText.from_pretrained(
    pretrained_model_name_or_path=MODEL_PATH,
    torch_dtype="auto",
    device_map="auto",
)
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device)
inputs.pop("token_type_ids", None)
generated_ids = model.generate(**inputs, max_new_tokens=8192)
output_text = processor.decode(generated_ids[0][inputs["input_ids"].shape[1]:], skip_special_tokens=False)
print(output_text)

Prompt Limited

GLM-OCR currently supports two types of prompt scenarios:

  1. Document Parsing – extract raw content from documents. Supported tasks include:
{
    "text": "Text Recognition:",
    "formula": "Formula Recognition:",
    "table": "Table Recognition:"
}
  1. Information Extraction – extract structured information from documents. Prompts must follow a strict JSON schema. For example, to extract personal ID information:
请按下列JSON格式输出图中信息:
{
    "id_number": "",
    "last_name": "",
    "first_name": "",
    "date_of_birth": "",
    "address": {
        "street": "",
        "city": "",
        "state": "",
        "zip_code": ""
    },
    "dates": {
        "issue_date": "",
        "expiration_date": ""
    },
    "sex": ""
}

⚠️ Note: When using information extraction, the output must strictly adhere to the defined JSON schema to ensure downstream processing compatibility.

GLM-OCR SDK

We provide an easy-to-use SDK for using GLM-OCR more efficiently and conveniently. please check our github to get more detail.

Acknowledgement

This project is inspired by the excellent work of the following projects and communities:

License

The GLM-OCR model is released under the MIT License.

The complete OCR pipeline integrates PP-DocLayoutV3 for document layout analysis, which is licensed under the Apache License 2.0. Users should comply with both licenses when using this project.

Author: unsloth

Likes: 4

Downloads: 0

Tags: transformers, safetensors, glm_ocr, image-to-text, zh, en, fr, es, ru, de, ja, ko, base_model:zai-org/GLM-OCR, base_model:finetune:zai-org/GLM-OCR, license:mit, endpoints_compatible, region:us

ServiceNow-AI/apriel-multi-env-1

Author: ServiceNow-AI

Likes: 3

Downloads: 0

Tags: safetensors, mistral, region:us

unsloth/Qwen3-Coder-Next-FP8-Dynamic


tags:

  • qwen3_next
  • unsloth
  • qwen
  • qwen3 base_model:
  • Qwen/Qwen3-Coder-Next library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3-Coder-Next/blob/main/LICENSE pipeline_tag: text-generation

<div> <p style="margin-bottom: 0; margin-top: 0;"> <h1 style="margin-top: 0rem;">To Run Qwen3-Coder-Next locally - <a href="https://unsloth.ai/docs/models/qwen3-coder-next">Read our Guide!</a></h1> </p> <p style="margin-top: 0;margin-bottom: 0;"> <em><a href="https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs">Unsloth Dynamic 2.0</a> achieves superior accuracy & outperforms other leading quants.</em> </p> <div style="margin-top: 0;display: flex; gap: 5px; align-items: center; "> <a href="https://github.com/unslothai/unsloth/"> <img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="133"> </a> <a href="https://discord.gg/unsloth"> <img src="https://github.com/unslothai/unsloth/raw/main/images/Discord%20button.png" width="173"> </a> <a href="https://unsloth.ai/docs/models/qwen3-coder-next"> <img src="https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/documentation%20green%20button.png" width="143"> </a> </div> </div>

Qwen3-Coder-Next

Highlights

Today, we're announcing Qwen3-Coder-Next, an open-weight language model designed specifically for coding agents and local development. It features the following key enhancements:

  • Super Efficient with Significant Performance: With only 3B activated parameters (80B total parameters), it achieves performance comparable to models with 10–20x more active parameters, making it highly cost-effective for agent deployment.
  • Advanced Agentic Capabilities: Through an elaborate training recipe, it excels at long-horizon reasoning, complex tool usage, and recovery from execution failures, ensuring robust performance in dynamic coding tasks.
  • Versatile Integration with Real-World IDE: Its 256k context length, combined with adaptability to various scaffold templates, enables seamless integration with different CLI/IDE platforms (e.g., Claude Code, Qwen Code, Qoder, Kilo, Trae, Cline, etc.), supporting diverse development environments.

image/jpeg

image/jpeg

Model Overview

Qwen3-Coder-Next has the following features:

  • Type: Causal Language Models
  • Training Stage: Pretraining & Post-training
  • Number of Parameters: 80B in total and 3B activated
  • Number of Parameters (Non-Embedding): 79B
  • Hidden Dimension: 2048
  • Number of Layers: 48
    • Hybrid Layout: 12 * (3 * (Gated DeltaNet -> MoE) -> 1 * (Gated Attention -> MoE))
  • Gated Attention:
    • Number of Attention Heads: 16 for Q and 2 for KV
    • Head Dimension: 256
    • Rotary Position Embedding Dimension: 64
  • Gated DeltaNet:
    • Number of Linear Attention Heads: 32 for V and 16 for QK
    • Head Dimension: 128
  • Mixture of Experts:
    • Number of Experts: 512
    • Number of Activated Experts: 10
    • Number of Shared Experts: 1
    • Expert Intermediate Dimension: 512
  • Context Length: 262,144 natively

NOTE: This model supports only non-thinking mode and does not generate <think></think> blocks in its output. Meanwhile, specifying enable_thinking=False is no longer required.

For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation.

Quickstart

We advise you to use the latest version of transformers.

The following contains a code snippet illustrating how to use the model generate content based on given inputs.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-Coder-Next"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
  model_name,
  torch_dtype="auto",
  device_map="auto"
)

# prepare the model input
prompt = "Write a quick sort algorithm."
messages = [
  {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
  messages,
  tokenize=False,
  add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=65536
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

content = tokenizer.decode(output_ids, skip_special_tokens=True)

print("content:", content)

Note: If you encounter out-of-memory (OOM) issues, consider reducing the context length to a shorter value, such as 32,768.

For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3.

Deployment

For deployment, you can use the latest sglang or vllm to create an OpenAI-compatible API endpoint.

SGLang

SGLang is a fast serving framework for large language models and vision language models. SGLang could be used to launch a server with OpenAI-compatible API service.

sglang>=v0.5.8 is required for Qwen3-Coder-Next, which can be installed using:

pip install 'sglang[all]>=v0.5.8'

See its documentation for more details.

The following command can be used to create an API endpoint at http://localhost:30000/v1 with maximum context length 256K tokens using tensor parallel on 4 GPUs.

python -m sglang.launch_server --model Qwen/Qwen3-Coder-Next --port 30000 --tp-size 2 --tool-call-parser qwen3_coder```

[!Note] The default context length is 256K. Consider reducing the context length to a smaller value, e.g., 32768, if the server fails to start.

vLLM

vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. vLLM could be used to launch a server with OpenAI-compatible API service.

vllm>=0.15.0 is required for Qwen3-Coder-Next, which can be installed using:

pip install 'vllm>=0.15.0'

See its documentation for more details.

The following command can be used to create an API endpoint at http://localhost:8000/v1 with maximum context length 256K tokens using tensor parallel on 4 GPUs.

vllm serve Qwen/Qwen3-Coder-Next --port 8000 --tensor-parallel-size 2 --enable-auto-tool-choice --tool-call-parser qwen3_coder

[!Note] The default context length is 256K. Consider reducing the context length to a smaller value, e.g., 32768, if the server fails to start.

Agentic Coding

Qwen3-Coder-Next excels in tool calling capabilities.

You can simply define or use any tools as following example.

# Your tool implementation
def square_the_number(num: float) -> dict:
    return num ** 2

# Define Tools
tools=[
    {
        "type":"function",
        "function":{
            "name": "square_the_number",
            "description": "output the square of the number.",
            "parameters": {
                "type": "object",
                "required": ["input_num"],
                "properties": {
                    'input_num': {
                        'type': 'number', 
                        'description': 'input_num is a number that will be squared'
                        }
                },
            }
        }
    }
]

from openai import OpenAI
# Define LLM
client = OpenAI(
    # Use a custom endpoint compatible with OpenAI API
    base_url='http://localhost:8000/v1',  # api_base
    api_key="EMPTY"
)
 
messages = [{'role': 'user', 'content': 'square the number 1024'}]

completion = client.chat.completions.create(
    messages=messages,
    model="Qwen3-Coder-Next",
    max_tokens=65536,
    tools=tools,
)

print(completion.choices[0])

Best Practices

To achieve optimal performance, we recommend the following sampling parameters: temperature=1.0, top_p=0.95, top_k=40.

Citation

If you find our work helpful, feel free to give us a cite.

@techreport{qwen_qwen3_coder_next_tech_report,
  title        = {Qwen3-Coder-Next Technical Report},
  author       = {{Qwen Team}},
  url          = {https://github.com/QwenLM/Qwen3-Coder/blob/main/qwen3_coder_next_tech_report.pdf},
  note         = {Accessed: 2026-02-03}
}

Author: unsloth

Likes: 3

Downloads: 0

Tags: transformers, safetensors, qwen3_next, text-generation, unsloth, qwen, qwen3, conversational, base_model:Qwen/Qwen3-Coder-Next, base_model:quantized:Qwen/Qwen3-Coder-Next, license:apache-2.0, endpoints_compatible, compressed-tensors, region:us