Todays AI Summary

AI Developments: Cartoon Generation, Privacy Risks, and New Models Emerge

Here's a look at the latest AI advancements, covering research papers and newly released models.

Research Highlights

Several interesting research papers have been published, spanning diverse areas of AI:

  • Cartoon Production: The paper "ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing" introduces a new generative model that unifies inbetweening and colorization, reducing manual effort in cartoon and anime production. ToonComposer uses a sparse sketch injection mechanism for precise control and adapts a video foundation model to the cartoon domain. The paper also introduces PKBench, a benchmark featuring human-drawn sketches for evaluation.
  • Privacy Risks in LLM Agents: "Searching for Privacy Risks in LLM Agents via Simulation" presents a framework to identify vulnerabilities in LLM-based agents. The framework simulates interactions between agents to discover attack strategies that extract sensitive information. The research highlights the evolution of attack tactics and defenses in these simulated environments.
  • Diffusion Language Models: The survey paper "A Survey on Diffusion Language Models" provides a comprehensive overview of Diffusion Language Models (DLMs), a promising alternative to autoregressive models. The paper covers foundational principles, state-of-the-art models, inference strategies, and multimodal extensions of DLMs.
  • Orbital Path Planning: "TLE-Based A2C Agent for Terrestrial Coverage Orbital Path Planning" introduces a reinforcement learning framework using the Advantage Actor-Critic (A2C) algorithm to optimize satellite orbital parameters for terrestrial coverage. The A2C agent outperforms Proximal Policy Optimization (PPO) in achieving mission objectives while maintaining computational efficiency.
  • Medical Visual Question Answering: The Medico 2025 challenge, detailed in "Medico 2025: Visual Question Answering for Gastrointestinal Imaging," focuses on developing Explainable AI models for answering clinical questions based on GI endoscopy images. The challenge uses the Kvasir-VQA-x1 dataset and aims to advance trustworthy AI in medical image analysis.
  • Brain Tumor MRI Reasoning: "Performance of GPT-5 in Brain Tumor MRI Reasoning" evaluates the performance of GPT-5 family models on a brain tumor VQA benchmark. The results indicate moderate accuracy in structured neuro-oncological VQA tasks, but not at a level acceptable for clinical use.
  • Automated Interpreting Assessment: "From Black Box to Transparency: Enhancing Automated Interpreting Assessment with Explainable AI in College Classrooms" proposes a multi-dimensional modeling framework that integrates feature engineering, data augmentation, and explainable machine learning for automated interpreting quality assessment. The approach prioritizes explainability over "black box" predictions.
  • Reinforced Language Models: "Reinforced Language Models for Sequential Decision Making" introduces Multi-Step Group-Relative Policy Optimization (MS-GRPO), a new algorithm for post-training LLM agents for sequential decision-making. The algorithm is effective in improving decision-making performance, outperforming a 72B parameter baseline by 50% on the Frozen Lake task.
  • Subjective Self-Disclosure: "A Multimodal Neural Network for Recognizing Subjective Self-Disclosure Towards Social Robots" develops a custom multimodal attention network for recognizing subjective self-disclosure in human-robot interactions. The model achieves an F1 score of 0.83, an improvement of 0.48 from the best baseline model.
  • Echo State Networks: "Empirical Investigation into Configuring Echo State Networks for Representative Benchmark Problem Domains" examines Echo State Network performance using four different benchmark problems, then proposes heuristics or rules of thumb for configuring the architecture, as well as the selection of parameters and their values, which are applicable to problems within the same domain.

Model Releases

Several new models have been released, focusing on various applications:

  • Huihui-gpt-oss-120b-BF16-abliterated: This model, released by huihui-ai, is a large language model based on the GPT-OSS architecture. It is designed for conversational AI and text generation.
  • Wan22-FastMix: Zuntan has released Wan22-FastMix, an image-to-video model.
  • yolo11x_watermark_detection: corzent has released a fine-tuned YOLO11x model for watermark and logo detection.
  • Qwen2.5-VL-3B-Abliterated-Caption-it: prithivMLmods has released a fine-tuned version of Qwen2.5-VL-3B-Instruct, designed for uncensored image captioning. It generates detailed descriptions across a broad range of visual categories.

AI Papers for 2026-04-14

Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

Large language models (LLMs) undergo alignment training to avoid harmful behaviors, yet the resulting safeguards remain brittle: jailbreaks routinely bypass them, and fine-tuning on narrow domains can induce ``emergent misalignment'' that generalizes broadly. Whether this brittleness reflects a fundamental lack of coherent internal organization for harmfulness remains unclear. Here we use targeted weight pruning as a causal intervention to probe the internal organization of harmfulness in LLMs. We find that harmful content generation depends on a compact set of weights that are general across harm types and distinct from benign capabilities. Aligned models exhibit a greater compression of harm generation weights than unaligned counterparts, indicating that alignment reshapes harmful representations internally--despite the brittleness of safety guardrails at the surface level. This compression explains emergent misalignment: if weights of harmful capabilities are compressed, fine-tuning that engages these weights in one domain can trigger broad misalignment. Consistent with this, pruning harm generation weights in a narrow domain substantially reduces emergent misalignment. Notably, LLMs harmful generation capability is dissociated from how they recognize and explain such content. Together, these results reveal a coherent internal structure for harmfulness in LLMs that may serve as a foundation for more principled approaches to safety.

Case-Grounded Evidence Verification: A Framework for Constructing Evidence-Sensitive Supervision

Evidence-grounded reasoning requires more than attaching retrieved text to a prediction: a model should make decisions that depend on whether the provided evidence supports the target claim. In practice, this often fails because supervision is weak, evidence is only loosely tied to the claim, and evaluation does not test evidence dependence directly. We introduce case-grounded evidence verification, a general framework in which a model receives a local case context, external evidence, and a structured claim, and must decide whether the evidence supports the claim for that case. Our key contribution is a supervision construction procedure that generates explicit support examples together with semantically controlled non-support examples, including counterfactual wrong-state and topic-related negatives, without manual evidence annotation. We instantiate the framework in radiology and train a standard verifier on the resulting support task. The learned verifier substantially outperforms both case-only and evidence-only baselines, remains strong under correct evidence, and collapses when evidence is removed or swapped, indicating genuine evidence dependence. This behavior transfers across unseen evidence articles and an external case distribution, though performance degrades under evidence-source shift and remains sensitive to backbone choice. Overall, the results suggest that a major bottleneck in evidence grounding is not only model capacity, but the lack of supervision that encodes the causal role of evidence.

Seeing is Believing: Robust Vision-Guided Cross-Modal Prompt Learning under Label Noise

Prompt learning is a parameter-efficient approach for vision-language models, yet its robustness under label noise is less investigated. Visual content contains richer and more reliable semantic information, which remains more robust under label noise. However, the prompt itself is highly susceptible to label noise. Motivated by this intuition, we propose VisPrompt, a lightweight and robust vision-guided prompt learning framework for noisy-label settings. Specifically, we exploit a cross-modal attention mechanism to reversely inject visual semantics into prompt representations. This enables the prompt tokens to selectively aggregate visual information relevant to the current sample, thereby improving robustness by anchoring prompt learning to stable instance-level visual evidence and reducing the influence of noisy supervision. To address the instability caused by using the same way of injecting visual information for all samples, despite differences in the quality of their visual cues, we further introduce a lightweight conditional modulation mechanism to adaptively control the strength of visual information injection, which strikes a more robust balance between text-side semantic priors and image-side instance evidence. The proposed framework effectively suppresses the noise-induced disturbances, reduce instability in prompt updates, and alleviate memorization of mislabeled samples. VisPrompt significantly improves robustness while keeping the pretrained VLM backbone frozen and introducing only a small amount of additional trainable parameters. Extensive experiments under synthetic and real-world label noise demonstrate that VisPrompt generally outperforms existing baselines on seven benchmark datasets and achieves stronger robustness. Our code is publicly available at https://github.com/gezbww/Vis_Prompt.

VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images

Vision-language models (VLMs) still struggle with visual perception tasks such as spatial understanding and viewpoint recognition. One plausible contributing factor is that natural image datasets provide limited supervision for low-level visual skills. This motivates a practical question: can targeted synthetic supervision, generated from only a task keyword such as Depth Order, address these weaknesses? To investigate this question, we introduce VisionFoundry, a task-aware synthetic data generation pipeline that takes only the task name as input and uses large language models (LLMs) to generate questions, answers, and text-to-image (T2I) prompts, then synthesizes images with T2I models and verifies consistency with a proprietary VLM, requiring no reference images or human annotation. Using VisionFoundry, we construct VisionFoundry-10K, a synthetic visual question answering (VQA) dataset containing 10k image-question-answer triples spanning 10 tasks. Models trained on VisionFoundry-10K achieve substantial improvements on visual perception benchmarks: +7% on MMVP and +10% on CV-Bench-3D, while preserving broader capabilities and showing favorable scaling behavior as data size increases. Our results suggest that limited task-targeted supervision is an important contributor to this bottleneck and that synthetic supervision is a promising path toward more systematic training for VLMs.

VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning

Large Vision Language Models (LVLMs) achieve strong multimodal reasoning but frequently exhibit hallucinations and incorrect responses with high certainty, which hinders their usage in high-stakes domains. Existing verbalized confidence calibration methods, largely developed for text-only LLMs, typically optimize a single holistic confidence score using binary answer-level correctness. This design is mismatched to LVLMs: an incorrect prediction may arise from perceptual failures or from reasoning errors given correct perception, and a single confidence conflates these sources while visual uncertainty is often dominated by language priors. To address these issues, we propose VL-Calibration, a reinforcement learning framework that explicitly decouples confidence into visual and reasoning confidence. To supervise visual confidence without ground-truth perception labels, we introduce an intrinsic visual certainty estimation that combines (i) visual grounding measured by KL-divergence under image perturbations and (ii) internal certainty measured by token entropy. We further propose token-level advantage reweighting to focus optimization on tokens based on visual certainty, suppressing ungrounded hallucinations while preserving valid perception. Experiments on thirteen benchmarks show that VL-Calibration effectively improves calibration while boosting visual reasoning accuracy, and it generalizes to out-of-distribution benchmarks across model scales and architectures.

Envisioning the Future, One Step at a Time

Accurately anticipating how complex, diverse scenes will evolve requires models that represent uncertainty, simulate along extended interaction chains, and efficiently explore many plausible futures. Yet most existing approaches rely on dense video or latent-space prediction, expending substantial capacity on dense appearance rather than on the underlying sparse trajectories of points in the scene. This makes large-scale exploration of future hypotheses costly and limits performance when long-horizon, multi-modal motion is essential. We address this by formulating the prediction of open-set future scene dynamics as step-wise inference over sparse point trajectories. Our autoregressive diffusion model advances these trajectories through short, locally predictable transitions, explicitly modeling the growth of uncertainty over time. This dynamics-centric representation enables fast rollout of thousands of diverse futures from a single image, optionally guided by initial constraints on motion, while maintaining physical plausibility and long-range coherence. We further introduce OWM, a benchmark for open-set motion prediction based on diverse in-the-wild videos, to evaluate accuracy and variability of predicted trajectory distributions under real-world uncertainty. Our method matches or surpasses dense simulators in predictive accuracy while achieving orders-of-magnitude higher sampling speed, making open-set future prediction both scalable and practical. Project page: http://compvis.github.io/myriad.

Semantic Rate-Distortion for Bounded Multi-Agent Communication: Capacity-Derived Semantic Spaces and the Communication Cost of Alignment

When two agents of different computational capacities interact with the same environment, they need not compress a common semantic alphabet differently; they can induce different semantic alphabets altogether. We show that the quotient POMDP $Q_{m,T}(M)$ - the unique coarsest abstraction consistent with an agent's capacity - serves as a capacity-derived semantic space for any bounded agent, and that communication between heterogeneous agents exhibits a sharp structural phase transition. Below a critical rate $R_{\text{crit}}$ determined by the quotient mismatch, intent-preserving communication is structurally impossible. In the supported one-way memoryless regime, classical side-information coding then yields exponential decay above the induced benchmark. Classical coding theorems tell you the rate once the source alphabet is fixed; our contribution is to derive that alphabet from bounded interaction itself. Concretely, we prove: (1) a fixed-$\varepsilon$ structural phase-transition theorem whose lower bound is fully general on the common-history quotient comparison; (2) a one-way Wyner-Ziv benchmark identification on quotient alphabets, with exact converse, exact operational equality for memoryless quotient sources, and an ergodic long-run bridge via explicit mixing bounds; (3) an asymptotic one-way converse in the shrinking-distortion regime $\varepsilon = O(1/T)$, proved from the message stream and decoder side information; and (4) alignment traversal bounds enabling compositional communication through intermediate capacity levels. Experiments on eight POMDP environments (including RockSample(4,4)) illustrate the phase transition, a structured-policy benchmark shows the one-way rate can drop by up to $19\times$ relative to the counting bound, and a shrinking-distortion sweep matches the regime of the asymptotic converse.

VISOR: Agentic Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning

Visual Retrieval-Augmented Generation (VRAG) empowers Vision-Language Models to retrieve and reason over visually rich documents. To tackle complex queries requiring multi-step reasoning, agentic VRAG systems interleave reasoning with iterative retrieval.. However, existing agentic VRAG faces two critical bottlenecks. (1) Visual Evidence Sparsity: key evidence is scattered across pages yet processed in isolation, hindering cross-page reasoning; moreover, fine-grained intra-image evidence often requires precise visual actions, whose misuse degrades retrieval quality; (2) Search Drift in Long Horizons: the accumulation of visual tokens across retrieved pages dilutes context and causes cognitive overload, leading agents to deviate from their search objective. To address these challenges, we propose VISOR (Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning), a unified single-agent framework. VISOR features a structured Evidence Space for progressive cross-page reasoning, coupled with a Visual Action Evaluation and Correction mechanism to manage visual actions. Additionally, we introduce a Dynamic Trajectory with Sliding Window and Intent Injection to mitigate search drift. They anchor the evidence space while discarding earlier raw interactions, preventing context from being overwhelmed by visual tokens. We train VISOR using a Group Relative Policy Optimization-based Reinforcement Learning (GRPO-based RL) pipeline with state masking and credit assignment tailored for dynamic context reconstruction. Extensive experiments on ViDoSeek, SlideVQA, and MMLongBench demonstrate that VISOR achieves state-of-the-art performance with superior efficiency for long-horizon visual reasoning tasks.

Strategic Algorithmic Monoculture:Experimental Evidence from Coordination Games

AI agents increasingly operate in multi-agent environments where outcomes depend on coordination. We distinguish primary algorithmic monoculture -- baseline action similarity -- from strategic algorithmic monoculture, whereby agents adjust similarity in response to incentives. We implement a simple experimental design that cleanly separates these forces, and deploy it on human and large language model (LLM) subjects. LLMs exhibit high levels of baseline similarity (primary monoculture) and, like humans, they regulate it in response to coordination incentives (strategic monoculture). While LLMs coordinate extremely well on similar actions, they lag behind humans in sustaining heterogeneity when divergence is rewarded.

BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation

Accurate evaluation is central to the large language model (LLM) ecosystem, guiding model selection and downstream adoption across diverse use cases. In practice, however, evaluating generative outputs typically relies on rigid lexical methods to extract and assess answers, which can conflate a model's true problem-solving ability with its compliance with predefined formatting guidelines. While recent LLM-as-a-Judge approaches mitigate this issue by assessing semantic correctness rather than strict structural conformity, they also introduce substantial computational overhead, making evaluation costly. In this work, we first systematically investigate the limitations of lexical evaluation through a large-scale empirical study spanning 36 models and 15 downstream tasks, demonstrating that such methods correlate poorly with human judgments. To address this limitation, we introduce BERT-as-a-Judge, an encoder-driven approach for assessing answer correctness in reference-based generative settings, robust to variations in output phrasing, and requiring only lightweight training on synthetically annotated question-candidate-reference triplets. We show that it consistently outperforms the lexical baseline while matching the performance of much larger LLM judges, providing a compelling tradeoff between the two and enabling reliable, scalable evaluation. Finally, through extensive experimentation, we provide detailed insights into BERT-as-a-Judge's performance to offer practical guidance for practitioners, and release all project artifacts to foster downstream adoption.

AI Models

Rta-AILabs/Nandi-Mini-150M-Instruct


license: apache-2.0 language:

  • en
  • hi
  • mr
  • ta
  • te
  • kn
  • ml
  • bn
  • pa
  • gu
  • or pipeline_tag: text-generation library_name: transformers base_model:
  • Rta-AILabs/Nandi-Mini-150M

Nandi-Mini-150M-Instruct

Introduction

Nandi-Mini-150M-Instruct is a compact, efficient multilingual language model designed for strong performance in resource-constrained environments. It is pre-trained from scratch on 525 billion tokens and further enhanced through instruction tuning and Direct Preference Optimization (DPO). The model supports English and 10 Indic languages.

Nandi-Mini-150M-Instruct focuses on maximizing performance per parameter through architectural efficiency rather than scale. It is optimized for edge devices, on-prem deployments, and low-latency applications, making it ideal for resource-constrained environments. Nandi-Mini-150M-Instruct brings the following key features:

  • Strong multilingual capability across English and Indic languages
  • Efficient design enabling high performance at small scale (150M parameters)
  • Reduced memory footprint using factorized embeddings
  • Better parameter efficiency through layer sharing

πŸ“ Upcoming Releases & Roadmap

We’re just getting started with the Nandi series πŸš€

  • Nandi-Mini-150M-Tool-Calling (Specialized-Model) β€” Coming Soon this week
  • Nandi-Mini-500M (Base + Instruct) β€” Pre-Training Going On
  • Nandi-Mini-1B (Base + Instruct) β€” Pre-Training Going On

πŸ“’ Blogs & technical deep-dives coming soon, where we’ll share:

  • Architecture decisions and design trade-offs
  • Training insights and dataset composition
  • Benchmarks and real-world applications

Stay tuned!

🌍 Supported Languages

The model is trained on English and a diverse set of Indic languages, including:

  • Hindi, Bengali, Tamil, Telugu, Marathi, Gujarati, Kannada, Malayalam, Punjabi, Odia

πŸš€ Usage

!pip install transformers=='5.4.0'

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "Rta-AILabs/Nandi-Mini-150M-Instruct"

device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    dtype=torch.bfloat16
).to(device).eval()

prompt = "Explain newton's second law of motion"

messages = [
    {"role": "user", "content": prompt}
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **inputs,
    max_new_tokens=500,
    do_sample=True,
    temperature=0.3,
    top_p=0.90,
    top_k=20,
    repetition_penalty=1.1,
)

generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

print(response)

πŸ“¬ Feedback & Suggestions

We’d love to hear your thoughts, feedback, and ideas!

  • Email: support@rtaailabs.com
  • Official Website https://rtaailabs.com/
  • LinkedIn: https://www.linkedin.com/company/rta-ai-lab
  • X (Twitter): https://x.com/Rta_AILabs

Author: Rta-AILabs

Likes: 23

Downloads: 0

Tags: transformers, safetensors, nandi, text-generation, conversational, custom_code, en, hi, mr, ta, te, kn, ml, bn, pa, gu, or, base_model:Rta-AILabs/Nandi-Mini-150M, base_model:finetune:Rta-AILabs/Nandi-Mini-150M, license:apache-2.0, region:us

cyankiwi/MiniMax-M2.7-AWQ-4bit


pipeline_tag: text-generation license: other license_name: modified-mit license_link: https://github.com/MiniMax-AI/MiniMax-M2.7/blob/main/LICENSE library_name: transformers base_model: MiniMaxAI/MiniMax-M2.7

<div align="center"> <svg width="60%" height="auto" viewBox="0 0 144 48" fill="none" xmlns="http://www.w3.org/2000/svg"> <path d="M26.6782 7.96523C26.6782 7.02436 25.913 6.26087 24.9739 6.26087C24.0348 6.26087 23.2695 7.0261 23.2695 7.96523V36.2139C23.2695 38.4 21.4904 40.1791 19.3043 40.1791C17.1183 40.1791 15.3391 38.4 15.3391 36.2139V18.0904C15.3391 17.1496 14.5739 16.3861 13.6348 16.3861C12.6956 16.3861 11.9304 17.1513 11.9304 18.0904V25.7722C11.9304 27.9583 10.1513 29.7374 7.96518 29.7374C5.7791 29.7374 4 27.9583 4 25.7722V22.9878C4 22.3635 4.50609 21.8574 5.13043 21.8574C5.75478 21.8574 6.26087 22.3635 6.26087 22.9878V25.7722C6.26087 26.713 7.02605 27.4765 7.96518 27.4765C8.90431 27.4765 9.66954 26.7113 9.66954 25.7722V18.0904C9.66954 15.9044 11.4487 14.1252 13.6348 14.1252C15.8209 14.1252 17.6 15.9044 17.6 18.0904V36.2139C17.6 37.1548 18.3652 37.9183 19.3043 37.9183C20.2435 37.9183 21.0087 37.153 21.0087 36.2139V25.1322V7.96523C21.0087 5.77914 22.7878 4 24.9739 4C27.16 4 28.9391 5.77914 28.9391 7.96523V31.3565C28.9391 31.9809 28.433 32.487 27.8087 32.487C27.1843 32.487 26.6782 31.9809 26.6782 31.3565V7.96523ZM47.6539 14.1252C45.4678 14.1252 43.6887 15.9044 43.6887 18.0904V33.2296C43.6887 34.1704 42.9235 34.9339 41.9843 34.9339C41.0452 34.9339 40.28 34.1687 40.28 33.2296V7.96523C40.28 5.77914 38.5008 4 36.3148 4C34.1287 4 32.3496 5.77914 32.3496 7.96523V40.0348C32.3496 40.9756 31.5843 41.7391 30.6452 41.7391C29.7061 41.7391 28.9409 40.9739 28.9409 40.0348V36.0643C28.9409 35.44 28.4348 34.9339 27.8104 34.9339C27.1861 34.9339 26.68 35.44 26.68 36.0643V40.0348C26.68 42.2209 28.4591 44 30.6452 44C32.8313 44 34.6104 42.2209 34.6104 40.0348V7.96523C34.6104 7.02436 35.3756 6.26087 36.3148 6.26087C37.2539 6.26087 38.0191 7.0261 38.0191 7.96523V33.2296C38.0191 35.4156 39.7982 37.1948 41.9843 37.1948C44.1704 37.1948 45.9496 35.4156 45.9496 33.2296V18.0904C45.9496 17.1496 46.7148 16.3861 47.6539 16.3861C48.593 16.3861 49.3582 17.1513 49.3582 18.0904V31.3565C49.3582 31.9809 49.8643 32.487 50.4887 32.487C51.113 32.487 51.6191 31.9809 51.6191 31.3565V18.0904C51.6191 15.9044 49.84 14.1252 47.6539 14.1252Z" fill="url(#paint0_linear_17_483)"/> <path d="M68.7671 16.5615H71.2541C71.3254 16.5615 71.3845 16.5859 71.435 16.6363C71.4836 16.6868 71.5097 16.7459 71.5097 16.8172V31.1824C71.5097 31.2537 71.4854 31.3128 71.435 31.3633C71.3845 31.4137 71.3254 31.4381 71.2541 31.4381H68.7671C68.6958 31.4381 68.6367 31.4137 68.5862 31.3633C68.5358 31.3146 68.5115 31.2537 68.5115 31.1824V21.812C68.5115 21.7563 68.4976 21.7268 68.4697 21.7268C68.4419 21.7268 68.4123 21.7476 68.3845 21.7911L66.1323 25.318C66.061 25.4311 65.9619 25.4885 65.8349 25.4885H64.581C64.4541 25.4885 64.3549 25.4328 64.2836 25.318L62.0315 21.7911C62.0036 21.7494 61.9741 21.7302 61.9462 21.7372C61.9184 21.7441 61.9045 21.7772 61.9045 21.8328V31.1824C61.9045 31.2537 61.8802 31.3128 61.8297 31.3633C61.7793 31.4137 61.7202 31.4381 61.6489 31.4381H59.1619C59.0906 31.4381 59.0315 31.4137 58.981 31.3633C58.9306 31.3146 58.9062 31.2537 58.9062 31.1824V16.8172C58.9062 16.7459 58.9306 16.6868 58.981 16.6363C59.0315 16.5859 59.0906 16.5615 59.1619 16.5615H61.6489C61.7758 16.5615 61.8749 16.6189 61.9462 16.732L65.1341 21.6833C65.1758 21.7685 65.2193 21.7685 65.261 21.6833L68.4697 16.732C68.541 16.6189 68.6402 16.5615 68.7671 16.5615Z" fill="currentColor"/> <path d="M74.1764 31.3633C74.1259 31.3146 74.1016 31.2537 74.1016 31.1824V16.8172C74.1016 16.7459 74.1259 16.6868 74.1764 16.6363C74.2268 16.5859 74.2859 16.5615 74.3572 16.5615H76.8442C76.9155 16.5615 76.9746 16.5859 77.0251 16.6363C77.0737 16.6868 77.0998 16.7459 77.0998 16.8172V31.1824C77.0998 31.2537 77.0755 31.3128 77.0251 31.3633C76.9746 31.4137 76.9155 31.4381 76.8442 31.4381H74.3572C74.2859 31.4381 74.2268 31.4137 74.1764 31.3633Z" fill="currentColor"/> <path d="M88.3066 16.6361C88.3553 16.5874 88.4162 16.5613 88.4875 16.5613H90.9744C91.0457 16.5613 91.1049 16.5857 91.1553 16.6361C91.204 16.6865 91.2301 16.7457 91.2301 16.817V31.1822C91.2301 31.2535 91.2057 31.3126 91.1553 31.363C91.1049 31.4135 91.0457 31.4378 90.9744 31.4378H88.5727C88.4301 31.4378 88.331 31.3822 88.2753 31.2674L82.771 22.1717C82.7431 22.13 82.7136 22.1109 82.6858 22.1178C82.6579 22.1248 82.644 22.1578 82.644 22.2135L82.6858 31.1805C82.6858 31.2518 82.6614 31.3109 82.611 31.3613C82.5606 31.4117 82.5014 31.4361 82.4301 31.4361H79.9431C79.8718 31.4361 79.8127 31.4117 79.7623 31.3613C79.7118 31.3126 79.6875 31.2518 79.6875 31.1805V16.8152C79.6875 16.7439 79.7118 16.6848 79.7623 16.6344C79.8127 16.5839 79.8718 16.5596 79.9431 16.5596H82.3449C82.4858 16.5596 82.5849 16.617 82.6423 16.73L88.124 25.7822C88.1518 25.8239 88.1797 25.8431 88.2092 25.8361C88.2371 25.8292 88.251 25.7978 88.251 25.7404L88.2301 16.8152C88.2301 16.7439 88.2545 16.6848 88.3049 16.6344L88.3066 16.6361Z" fill="currentColor"/> <path d="M93.8951 31.3633C93.8446 31.3146 93.8203 31.2537 93.8203 31.1824V16.8172C93.8203 16.7459 93.8446 16.6868 93.8951 16.6363C93.9455 16.5859 94.0047 16.5615 94.076 16.5615H96.5629C96.6342 16.5615 96.6934 16.5859 96.7438 16.6363C96.7925 16.6868 96.8186 16.7459 96.8186 16.8172V31.1824C96.8186 31.2537 96.7942 31.3128 96.7438 31.3633C96.6934 31.4137 96.6342 31.4381 96.5629 31.4381H94.076C94.0047 31.4381 93.9455 31.4137 93.8951 31.3633Z" fill="currentColor"/> <path d="M109.267 16.5615H111.754C111.825 16.5615 111.885 16.5859 111.935 16.6363C111.984 16.6868 112.01 16.7459 112.01 16.8172V31.1824C112.01 31.2537 111.985 31.3128 111.935 31.3633C111.885 31.4137 111.825 31.4381 111.754 31.4381H109.267C109.196 31.4381 109.137 31.4137 109.086 31.3633C109.036 31.3146 109.011 31.2537 109.011 31.1824V21.812C109.011 21.7563 108.998 21.7268 108.97 21.7268C108.942 21.7268 108.912 21.7476 108.885 21.7911L106.632 25.318C106.561 25.4311 106.462 25.4885 106.335 25.4885H105.081C104.954 25.4885 104.855 25.4328 104.784 25.318L102.531 21.7911C102.504 21.7494 102.474 21.7302 102.446 21.7372C102.418 21.7441 102.405 21.7772 102.405 21.8328V31.1824C102.405 31.2537 102.38 31.3128 102.33 31.3633C102.279 31.4137 102.22 31.4381 102.149 31.4381H99.6619C99.5906 31.4381 99.5315 31.4137 99.481 31.3633C99.4306 31.3146 99.4062 31.2537 99.4062 31.1824V16.8172C99.4062 16.7459 99.4306 16.6868 99.481 16.6363C99.5315 16.5859 99.5906 16.5615 99.6619 16.5615H102.149C102.276 16.5615 102.375 16.6189 102.446 16.732L105.634 21.6833C105.676 21.7685 105.719 21.7685 105.761 21.6833L108.97 16.732C109.041 16.6189 109.14 16.5615 109.267 16.5615Z" fill="currentColor"/> <path d="M123.782 31.2241L123.144 29.1424C123.116 29.0867 123.079 29.0572 123.038 29.0572H117.81C117.768 29.0572 117.732 29.085 117.704 29.1424L117.088 31.2241C117.046 31.3668 116.954 31.4363 116.812 31.4363H114.112C114.027 31.4363 113.963 31.412 113.921 31.3615C113.879 31.3128 113.871 31.2381 113.9 31.1389L118.49 16.7737C118.532 16.6328 118.624 16.5615 118.766 16.5615H122.102C122.243 16.5615 122.335 16.6328 122.379 16.7737L126.968 31.1389C126.982 31.1668 126.989 31.2033 126.989 31.245C126.989 31.372 126.911 31.4363 126.756 31.4363H124.057C123.916 31.4363 123.824 31.365 123.78 31.2241H123.782ZM118.554 26.7407H122.295C122.38 26.7407 122.408 26.6989 122.38 26.6137L120.467 20.3024C120.453 20.2467 120.432 20.2207 120.403 20.2276C120.375 20.2346 120.352 20.2589 120.339 20.3024L118.469 26.6137C118.455 26.6989 118.483 26.7407 118.554 26.7407Z" fill="currentColor"/> <path d="M128.222 31.353C128.18 31.2974 128.187 31.2261 128.243 31.1409L132.365 24.0643C132.393 24.0226 132.393 23.9791 132.365 23.9374L128.243 16.8609L128.201 16.7339C128.201 16.6209 128.28 16.5635 128.434 16.5635H131.133C131.274 16.5635 131.38 16.6209 131.452 16.7339L134.213 21.6C134.255 21.6852 134.299 21.6852 134.34 21.6L137.102 16.7339C137.173 16.6209 137.28 16.5635 137.42 16.5635H140.099C140.198 16.5635 140.269 16.5913 140.311 16.6487C140.353 16.7061 140.346 16.7756 140.29 16.8609L136.168 23.9374C136.154 23.9791 136.154 24.0226 136.168 24.0643L140.29 31.1409L140.332 31.2678C140.332 31.3809 140.253 31.4383 140.099 31.4383H137.42C137.278 31.4383 137.172 31.3826 137.102 31.2678L134.34 26.4226C134.299 26.3374 134.255 26.3374 134.213 26.4226L131.429 31.2678C131.358 31.3809 131.252 31.4383 131.111 31.4383H128.433C128.333 31.4383 128.262 31.4104 128.22 31.353H128.222Z" fill="currentColor"/> <defs> <linearGradient id="paint0_linear_17_483" x1="3.99826" y1="24" x2="51.6208" y2="24" gradientUnits="userSpaceOnUse"> <stop stop-color="#E21680"/> <stop offset="1" stop-color="#FF633A"/> </linearGradient> </defs> </svg> </div> <hr> <div align="center" style="line-height: 1.4; font-size:16px; margin-top: 30px;"> Join Our <a href="https://platform.minimaxi.com/docs/faq/contact-us" target="_blank" style="font-size:17px; margin: 2px;"> πŸ’¬ WeChat </a> | <a href="https://discord.com/invite/DPC4AHFCBw" target="_blank" style="font-size:17px; margin: 2px;"> 🧩 Discord </a> community. </div> <div align="center" style="line-height: 1.2; font-size:16px;"> <a href="https://agent.minimax.io/" target="_blank" style="display: inline-block; margin: 4px;"> MiniMax Agent </a> | <a href="https://platform.minimax.io/docs/guides/text-generation" target="_blank" style="display: inline-block; margin: 4px;"> ⚑️ API </a> | <a href="https://github.com/MiniMax-AI/MiniMax-MCP" style="display: inline-block; margin: 4px;"> MCP </a> | <a href="https://www.minimax.io" target="_blank" style="display: inline-block; margin: 4px;"> MiniMax Website </a> </div> <div align="center" style="line-height: 1.2; font-size:16px; margin-bottom: 30px;"> <a href="https://huggingface.co/MiniMaxAI" target="_blank" style="margin: 2px;"> πŸ€— Hugging Face </a> | <a href="https://github.com/MiniMax-AI/MiniMax-M2.7" target="_blank" style="margin: 2px;"> πŸ™ GitHub </a> | <a href="https://www.modelscope.cn/organization/MiniMax" target="_blank" style="margin: 2px;"> πŸ€–οΈ ModelScope </a> | <a href="https://huggingface.co/MiniMaxAI/MiniMax-M2.7/blob/main/LICENSE" style="margin: 2px;"> πŸ“„ LICENSE </a> </div>

MiniMax-M2.7 is our first model deeply participating in its own evolution. M2.7 is capable of building complex agent harnesses and completing highly elaborate productivity tasks, leveraging Agent Teams, complex Skills, and dynamic tool search. For more details, see our blog post.

<p align="center"> <img width="100%" src="figures/benchmark_overview.png"> </p>

Model Self-Evolution

M2.7 initiates a cycle of model self-evolution: during development, we let the model update its own memory, build dozens of complex skills for RL experiments, and improve its own learning process based on experiment results. An internal version of M2.7 autonomously optimized a programming scaffold over 100+ rounds β€” analyzing failure trajectories, modifying code, running evaluations, and deciding to keep or revert β€” achieving a 30% performance improvement. On MLE Bench Lite (22 ML competitions), M2.7 achieved a 66.6% medal rate, second only to Opus-4.6 and GPT-5.4.

<p align="center"> <img width="100%" src="figures/agent_harness.png"> </p> <p align="center"> <img width="100%" src="figures/mle_bench.png"> </p>

Professional Software Engineering

M2.7 delivers outstanding real-world programming capabilities spanning log analysis, bug troubleshooting, refactoring, code security, and machine learning. Beyond code generation, M2.7 demonstrates strong system-level reasoning β€” correlating monitoring metrics, conducting trace analysis, verifying root causes in databases, and making SRE-level decisions. Using M2.7, we have reduced live production incident recovery time to under three minutes on multiple occasions.

On SWE-Pro, M2.7 achieved 56.22%, matching GPT-5.3-Codex, with even stronger performance on real-world engineering benchmarks: SWE Multilingual (76.5) and Multi SWE Bench (52.7). On VIBE-Pro (55.6%), M2.7 is nearly on par with Opus 4.6. On Terminal Bench 2 (57.0%) and NL2Repo (39.8%), M2.7 demonstrates deep understanding of complex engineering systems. M2.7 also supports native Agent Teams for multi-agent collaboration with stable role identity and autonomous decision-making.

<p align="center"> <img width="100%" src="figures/agent_teams.gif"> </p>

Professional Work

M2.7 achieved an ELO score of 1495 on GDPval-AA (highest among open-source models), surpassing GPT5.3. It handles Word, Excel, and PPT with high-fidelity multi-round editing, producing editable deliverables. On Toolathon, M2.7 reached 46.3% accuracy (global top tier), and maintains 97% skill compliance across 40+ complex skills on MM Claw. On the MM Claw end-to-end benchmark, M2.7 achieved 62.7%, close to Sonnet 4.6.

Entertainment

M2.7 features strengthened character consistency and emotional intelligence. We open-sourced OpenRoom, an interactive demo that places AI interaction within a Web GUI space with real-time visual feedback and scene interactions. Try it at openroom.ai.

How to Use

  • MiniMax Agent: https://agent.minimax.io/
  • MiniMax API: https://platform.minimax.io/
  • Token Plan: https://platform.minimax.io/subscribe/token-plan

Local Deployment Guide

Download the model from HuggingFace repository: https://huggingface.co/MiniMaxAI/MiniMax-M2.7

We recommend using the following inference frameworks (listed alphabetically) to serve the model:

SGLang

We recommend using SGLang to serve MiniMax-M2.7. Please refer to our SGLang Deployment Guide.

vLLM

We recommend using vLLM to serve MiniMax-M2.7. Please refer to our vLLM Deployment Guide.

Transformers

We recommend using Transformers to serve MiniMax-M2.7. Please refer to our Transformers Deployment Guide.

ModelScope

You also can get model weights from modelscope.

NVIDIA NIM

MiniMax M2.7 is also available on NVIDIA NIM Endpoint.

Inference Parameters

We recommend using the following parameters for best performance: temperature=1.0, top_p = 0.95, top_k = 40. Default system prompt:

You are a helpful assistant. Your name is MiniMax-M2.7 and is built by MiniMax.

Tool Calling Guide

Please refer to our Tool Calling Guide.

Contact Us

Contact us at model@minimax.io.

Author: cyankiwi

Likes: 10

Downloads: 58

Tags: transformers, safetensors, minimax_m2, text-generation, conversational, custom_code, base_model:MiniMaxAI/MiniMax-M2.7, base_model:quantized:MiniMaxAI/MiniMax-M2.7, license:other, endpoints_compatible, compressed-tensors, region:us

obsxrver/wan2.2-i2v-lightx2v-260412

Author: obsxrver

Likes: 8

Downloads: 0

Tags: region:us

OpenMOSS-Team/MOSS-Audio-8B-Thinking


license: apache-2.0 language:

  • en
  • zh tags:
  • audio
  • speech
  • music
  • understanding
  • multimodal
  • reasoning
  • chain-of-thought pipeline_tag: text-generation

MOSS-Audio

<p align="center"> <img src="./assets/moss-audio-logo.png" width="55%" /> </p> <div align="center"> <a href="https://huggingface.co/collections/OpenMOSS-Team/moss-audio"><img src="https://img.shields.io/badge/Huggingface-Models-orange?logo=huggingface&amp"></a> <img src="https://img.shields.io/badge/Blog-Coming_Soon-blue?logo=internet-explorer&amp"> <img src="https://img.shields.io/badge/Arxiv-Coming_Soon-red?logo=Arxiv&amp">

<a href="https://x.com/Open_MOSS"><img src="https://img.shields.io/badge/Twitter-Follow-black?logo=x&amp"></a> <a href="https://discord.gg/Xf3aXddCjc"><img src="https://img.shields.io/badge/Discord-Join-5865F2?logo=discord&amp"></a> <a href="./assets/wechat.jpg"><img src="https://img.shields.io/badge/WeChat-Join-07C160?logo=wechat&amp;logoColor=white" alt="WeChat"></a>

</div>

MOSS-Audio is an open-source audio understanding model from MOSI.AI, the OpenMOSS team, and Shanghai Innovation Institute. It performs unified modeling over complex real-world audio, supporting speech understanding, environmental sound understanding, music understanding, audio captioning, time-aware QA, and complex reasoning. In this release, we provide four models: MOSS-Audio-4B-Instruct, MOSS-Audio-4B-Thinking, MOSS-Audio-8B-Instruct, and MOSS-Audio-8B-Thinking. The Instruct variants are optimized for direct instruction following, while the Thinking variants provide stronger chain-of-thought reasoning capabilities.

News

  • 2026.4.13: πŸŽ‰πŸŽ‰πŸŽ‰ We have released MOSS-Audio. Blog and paper coming soon!

Contents

Introduction

<p align="center"> <img src="./assets/moss-audio-image.png" width="95%" /> </p>

Understanding audio requires more than simply transcribing words β€” it demands the ability to perceive acoustic cues, recognize speakers and emotions, interpret environmental sounds, reason over temporal context, and handle complex multi-step inference. MOSS-Audio is built to unify these capabilities within a single model.

  • Speech & Content Understanding: Accurately recognizes and transcribes spoken content from audio inputs, producing clean and well-structured text outputs. Supports both word-level and sentence-level timestamp alignment.
  • Speaker, Emotion & Event Analysis: Identifies speaker characteristics, analyzes emotional states based on tone, timbre, and context, and detects key acoustic events within the audio.
  • Scene & Sound Cue Extraction: Extracts meaningful cues from background sounds, environmental noise, music, and non-speech signals to infer scene context and atmosphere.
  • Music Understanding: Analyzes musical style, emotional progression, instrumentation, and salient acoustic features in music segments.
  • Audio Question Answering & Summarization: Answers questions and generates summaries about speech, podcasts, meetings, interviews, and environmental recordings, helping users efficiently extract key information.
  • Time-Aware QA: Supports time-aware questions, including word-level and sentence-level timestamp ASR.
  • Complex Reasoning: Performs multi-hop reasoning over audio content, powered by chain-of-thought training and reinforcement learning.

Model Architecture

<p align="center"> <img src="./assets/arc.png" width="95%" /> </p>

MOSS-Audio follows a modular design comprising three components: an audio encoder, a modality adapter, and a large language model. Raw audio is first encoded by MOSS-Audio-Encoder into continuous temporal representations at 12.5 Hz, which are then projected into the language model's embedding space through the adapter and finally consumed by the LLM for auto-regressive text generation.

Rather than relying on off-the-shelf audio frontends, we train a dedicated encoder from scratch to obtain more robust speech representations, tighter temporal alignment, and better extensibility across acoustic domains.

DeepStack Cross-Layer Feature Injection

Using only the encoder's top-layer features tends to lose low-level prosody, transient events, and local time-frequency structure. To address this, we design a DeepStack-inspired cross-layer injection module between the encoder and the language model: in addition to the encoder's final-layer output, features from earlier and intermediate layers are selected, independently projected, and injected into the language model's early layers, preserving multi-granularity information from low-level acoustic details to high-level semantic abstractions.

This design is especially well-suited for audio understanding tasks, as it helps retain rhythm, timbre, transients, and background structure β€” information that a single high-level representation cannot fully capture.

Time-Aware Representation

Time is a critical dimension in audio understanding. To enhance explicit temporal awareness, we adopt a time-marker insertion strategy during pretraining: explicit time tokens are inserted between audio frame representations at fixed time intervals to indicate temporal positions. This design enables the model to learn "what happened when" within a unified text generation framework, naturally supporting timestamp ASR, event localization, time-based QA, and long-audio retrospection.

Released Models

| Model | Audio Encoder | LLM Backbone | Total Size | Hugging Face | |---|---|---|---:|---| | MOSS-Audio-4B-Instruct | MOSS-Audio-Encoder | Qwen3-4B | ~4.6B | Hugging Face | MOSS-Audio-4B-Thinking | MOSS-Audio-Encoder | Qwen3-4B | ~4.6B | Hugging Face | MOSS-Audio-8B-Instruct | MOSS-Audio-Encoder | Qwen3-8B | ~8.6B | Hugging Face | MOSS-Audio-8B-Thinking | MOSS-Audio-Encoder | Qwen3-8B | ~8.6B | Hugging Face

More model families, sizes, and variants will be released in the future. Stay tuned!

Evaluation

We evaluate MOSS-Audio on a comprehensive set of audio understanding benchmarks. Key results:

  • General Audio Understanding: MOSS-Audio-8B-Thinking achieves an average accuracy of 70.80, outperforming all of the open-source models.
  • Speech Captioning: MOSS-Audio-Instruct variants lead across 11 out of 13 fine-grained speech description dimensions, with MOSS-Audio-8B-Instruct achieving the best overall average score (3.7252).
  • ASR: On a diverse ASR benchmark suite spanning 12 evaluation dimensions, MOSS-Audio achieves the lowest overall CER (11.30), with particular strength in health-condition, code-switching, dialect, singing, and non-speech scenarios.
  • Timestamp ASR: MOSS-Audio-8B-Instruct achieves 35.77 AAS on AISHELL-1 and 131.61 AAS on LibriSpeech, dramatically outperforming Qwen3-Omni (833.66) and Gemini-3.1-Pro (708.24) in timestamp asr accuracy.

General Audio Understanding (Accuracy↑)

<p align="center"> <img src="./assets/general_audio_bar.svg" width="75%" /> </p> <table> <thead> <tr> <th>Model</th> <th>Model Size</th> <th>MMAU</th> <th>MMAU-Pro</th> <th>MMAR</th> <th>MMSU</th> <th>Avg</th> </tr> </thead> <tbody> <tr><td colspan="7"><em><strong>Open Source (small)</strong></em></td></tr> <tr> <td>Kimi-Audio</td><td>7B</td><td>72.41</td><td>56.58</td><td>60.82</td><td>54.74</td><td>61.14</td> </tr> <tr> <td>Qwen2.5-Omni</td><td>7B</td><td>65.60</td><td>52.20</td><td>56.70</td><td>61.32</td><td>58.96</td> </tr> <tr> <td>Audio Flamingo 3</td><td>7B</td><td>61.23</td><td>51.70</td><td>57.96</td><td>60.04</td><td>57.73</td> </tr> <tr> <td>MiMo-Audio-7B</td><td>7B</td><td>74.90</td><td>53.35</td><td>61.70</td><td>61.94</td><td>62.97</td> </tr> <tr> <td>MiniCPM-o-4.5</td><td>9B</td><td>70.97</td><td>39.65</td><td>55.75</td><td>60.96</td><td>56.83</td> </tr> <tr> <td><strong>MOSS-Audio-4B-Instruct</strong></td><td><strong>4B</strong></td><td>75.79</td><td>58.16</td><td>59.68</td><td>59.68</td><td>64.04</td> </tr> <tr> <td><strong>MOSS-Audio-4B-Thinking</strong></td><td><strong>4B</strong></td><td><strong>77.64</strong></td><td>60.75</td><td>63.91</td><td>71.20</td><td>68.37</td> </tr> <tr> <td><strong>MOSS-Audio-8B-Instruct</strong></td><td><strong>8B</strong></td><td>77.03</td><td>57.48</td><td>64.42</td><td>66.36</td><td>66.32</td> </tr> <tr> <td><strong>MOSS-Audio-8B-Thinking</strong></td><td><strong>8B</strong></td><td>77.13</td><td><strong>64.29</strong></td><td><strong>65.73</strong></td><td><strong>76.06</strong></td><td><strong>70.80</strong></td> </tr> <tr><td colspan="7"><em><strong>Open Source (large)</strong></em></td></tr> <tr> <td>Qwen3-Omni-30B-A3B-Instruct</td><td>30B</td><td>75.00</td><td><strong>61.22</strong></td><td>66.40</td><td>69.00</td><td>67.91</td> </tr> <tr> <td>Step-Audio-R1.1</td><td>33B</td><td>72.18</td><td>60.80</td><td>68.75</td><td>64.18</td><td>66.48</td> </tr> <tr> <td>Step-Audio-R1</td><td>33B</td><td><strong>78.67</strong></td><td>59.68</td><td><strong>69.15</strong></td><td><strong>75.18</strong></td><td><strong>70.67</strong></td> </tr> <tr><td colspan="7"><em><strong>Closed Source</strong></em></td></tr> <tr> <td>GPT4o-Audio</td><td>-</td><td>65.66</td><td>52.30</td><td>59.78</td><td>58.76</td><td>59.13</td> </tr> <tr> <td>Gemini-3-Pro</td><td>-</td><td>80.15</td><td>68.28</td><td>81.73</td><td>81.28</td><td>77.86</td> </tr> <tr> <td>Gemini-3.1-Pro</td><td>-</td><td><strong>81.10</strong></td><td><strong>73.47</strong></td><td><strong>83.70</strong></td><td><strong>81.30</strong></td><td><strong>79.89</strong></td> </tr> </tbody> </table>

Speech Captioning (LLM-as-a-Judge Score↑)

<p align="center"> <img src="./assets/speech_caption_radar.png" width="70%" /> </p> <details> <summary><strong>Speech Captioning (click to expand)</strong></summary>

| Model | gender | age | accent | pitch | volume | speed | texture | clarity | fluency | emotion | tone | personality | summary | Avg | |---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:| | Qwen3-Omni-30B-A3B-Instruct | 4.436 | 3.936 | 4.356 | 3.590 | 3.682 | 3.614 | 3.093 | 3.521 | 3.531 | 3.328 | 3.224 | 3.292 | 3.179 | 3.5986 | | Qwen3-Omni-30B-A3B-Thinking | 4.419 | 4.026 | 4.327 | 3.610 | 3.577 | 3.610 | 3.179 | 3.403 | 3.526 | 3.232 | 3.154 | 3.197 | 3.107 | 3.5667 | | Gemini-3-Pro | 4.191 | 3.835 | 4.181 | 3.392 | 3.254 | 3.320 | 2.998 | 3.347 | 3.524 | 3.055 | 2.997 | 3.023 | 2.775 | 3.3763 | | Gemini-3.1-Pro| 4.436 | 3.936 | 4.356 | 3.590 | 3.682 | 3.614 | 3.093 | 3.521 | 3.531 | 3.328 | 3.224 | 3.292 | 3.179 | 3.5986 | | MOSS-Audio-4B-Instruct | 4.697 | 3.980 | 4.497 | 3.628 | 3.722 | 3.564 | 3.407 | 3.841 | 3.744 | 3.311 | 3.282 | 3.305 | 3.259 | 3.7105 | | MOSS-Audio-8B-Instruct | 4.683 | 3.979 | 4.572 | 3.682 | 3.709 | 3.638 | 3.403 | 3.869 | 3.747 | 3.314 | 3.253 | 3.272 | 3.307 | 3.7252 |

</details>

ASR

| Model | Overall | Health Condition | Dialect | Singing | Non-Speech Vocalizations | Code-Switching | Acoustic Environment (Clean) | Acoustic Environment (Noisy) | Acoustic Characteristics: Whisper | Acoustic Characteristics: Far-Field / Near-Field | Multi-Speaker | Age | Semantic Content | |---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:| | Paraformer-Large | 15.77 | 22.18 | 43.45 | 32.34 | 4.95 | 12.65 | 3.11 | 4.67 | 5.02 | 17.46 | 20.33 | 14.96 | 7.14 | | GLM-ASR-Nano | 17.29 | 24.49 | 22.39 | 51.95 | 4.65 | 11.88 | 3.68 | 5.02 | 4.94 | 27.51 | 28.02 | 17.19 | 7.32 | | Fun-ASR-Nano | 12.04 | 21.99 | 7.80 | 19.35 | 4.76 | 11.23 | 2.98 | 3.46 | 3.78 | 18.38 | 19.82 | 14.95 | 6.08 | | SenseVoice-Small | 14.50 | 24.04 | 8.89 | 23.79 | 4.92 | 13.90 | 4.13 | 4.93 | 5.57 | 26.66 | 24.06 | 17.63 | 7.55 | | Kimi-Audio-7B-Instruct | 14.12 | 21.11 | 29.34 | 21.76 | 4.68 | 16.38 | 2.20 | 2.15 | 2.66 | 21.02 | 20.61 | 16.74 | 6.12 | | Qwen2.5-Omni-3B | 15.26 | 24.65 | 33.87 | 24.24 | 5.54 | 11.66 | 2.76 | 3.56 | 4.32 | 22.15 | 22.91 | 15.17 | 7.24 | | Qwen2.5-Omni-7B | 15.05 | 23.85 | 31.91 | 22.69 | 4.56 | 12.97 | 2.52 | 3.16 | 3.64 | 25.38 | 21.01 | 16.13 | 6.78 | | Qwen3-Omni-30B-A3B-Instruct | 11.39 | 20.73 | 15.63 | 16.01 | 4.73 | 11.30 | 2.23 | 2.47 | 1.90 | 17.08 | 18.15 | 11.46 | 5.74 | | MOSS-Audio-4B-Instruct | 11.58 | 21.11 | 11.84 | 10.79 | 4.01 | 10.11 | 3.11 | 3.72 | 3.29 | 18.48 | 20.33 | 15.09 | 8.15 | | MOSS-Audio-8B-Instruct | 11.30 | 19.18 | 8.76 | 9.81 | 4.31 | 10.18 | 2.70 | 3.20 | 2.75 | 24.04 | 24.36 | 15.26 | 7.69 |

<details> <summary><strong>Detailed ASR Results (click to expand)</strong></summary> <table> <tr> <th rowspan="2">Model</th> <th colspan="3">Acoustic Environment (Clean)</th> <th colspan="1">Acoustic Environment (Noisy)</th> <th colspan="1">Acoustic Characteristics: Whisper</th> <th colspan="1">Acoustic Characteristics: Far-Field / Near-Field</th> <th colspan="1">Multi-Speaker</th> <th colspan="2">Age</th> <th colspan="2">Health Condition</th> <th colspan="2">Semantic Content</th> <th colspan="3">Code-Switching</th> <th colspan="2">Dialect</th> <th colspan="2">Singing</th> <th colspan="1">Non-Speech Vocalizations</th> </tr> <tr> <th>AISHELL-1<br><em>test</em></th> <th>AISHELL-2<br><em>Android | IOS | Mic</em></th> <th>THCHS-30<br><em>test</em></th> <th>MAGICDATA-READ<br><em>test</em></th> <th>AISHELL6-Whisper<br><em>normal | whisper</em></th> <th>AliMeeting<br><em>Test_Ali_far | Test_Ali_near</em></th> <th>AISHELL-4<br><em>test</em></th> <th>SeniorTalk<br><em>sentence</em></th> <th>ChildMandarin<br><em>test</em></th> <th>AISHELL-6A<br><em>mild | moderate | severe | StutteringSpeech</em></th> <th>AISHELL_6B<br><em>LRDWWS | Uncontrol</em></th> <th>WenetSpeech<br><em>test-meeting</em></th> <th>Fleurs<br><em>cmn_hans_cn</em></th> <th>CS-Dialogue<br><em>test</em></th> <th>TALCS<br><em>test</em></th> <th>ASCEND<br><em>test</em></th> <th>KeSpeech<br><em>test</em></th> <th>WSYue-ASR-eval<br><em>short</em></th> <th>MIR-1K<br><em>test</em></th> <th>openc-pop<br><em>test</em></th> <th>MNV_17</th> </tr> <tr> <td>Paraformer-Large</td> <td>1.98</td> <td>3.28 | 3.21 | 3.00</td> <td>4.07</td> <td>4.67</td> <td>1.11 | 8.92</td> <td><strong>25.64</strong> | 9.27</td> <td>20.33</td> <td>17.31</td> <td>12.60</td> <td>6.98 | 9.30 | 13.34 | 10.74</td> <td>47.59 | 45.08</td> <td>7.88</td> <td>6.40</td> <td>10.64</td> <td>10.77</td> <td>16.55</td> <td>11.48</td> <td>75.42</td> <td>57.70</td> <td>6.98</td> <td>4.95</td> </tr> <tr> <td>GLM-ASR-Nano</td> <td>2.89</td> <td>3.75 | 3.73 | 3.78</td> <td>4.23</td> <td>5.02</td> <td>0.83 | 9.06</td> <td>40.27 | 14.76</td> <td>28.02</td> <td>20.33</td> <td>14.06</td> <td>8.74 | 12.11 | 14.38 | 12.29</td> <td>50.34 | 49.09</td> <td>9.70</td> <td>4.94</td> <td>11.06</td> <td>11.07</td> <td>13.50</td> <td>9.72</td> <td>35.07</td> <td>95.87</td> <td>8.03</td> <td>4.65</td> </tr> <tr> <td>Fun-ASR-Nano</td> <td>2.16</td> <td>3.04 | 2.99 | 3.07</td> <td>3.65</td> <td>3.46</td> <td>0.81 | 6.76</td> <td>27.21 | 9.55</td> <td>19.82</td> <td>16.96</td> <td>12.94</td> <td>6.60 | <strong>8.81</strong> | 12.98 | 10.30</td> <td>47.42 | 45.84</td> <td>7.39</td> <td><strong>4.76</strong></td> <td>10.47</td> <td><strong>8.09</strong></td> <td>15.13</td> <td>7.43</td> <td>8.17</td> <td>35.85</td> <td>2.84</td> <td>4.76</td> </tr> <tr> <td>SenseVoice-Small</td> <td>3.23</td> <td>4.16 | 4.02 | 3.96</td> <td>5.26</td> <td>4.93</td> <td>1.25 | 9.88</td> <td>37.01 | 16.31</td> <td>24.06</td> <td>21.07</td> <td>14.18</td> <td>7.62 | 9.85 | 14.39 | 11.47</td> <td>52.92 | 47.97</td> <td>8.35</td> <td>6.75</td> <td>12.81</td> <td>10.52</td> <td>18.38</td> <td>10.45</td> <td><strong>7.34</strong></td> <td>39.51</td> <td>8.07</td> <td>4.92</td> </tr> <tr> <td>Kimi-Audio-7B-Instruct</td> <td><strong>0.79</strong></td> <td>2.91 | 3.03 | 2.88</td> <td><strong>1.39</strong></td> <td><strong>2.15</strong></td> <td>0.69 | 4.63</td> <td>28.22 | 13.82</td> <td>20.61</td> <td>19.70</td> <td>13.79</td> <td>7.00 | 9.34 | 12.56 | 10.75</td> <td>44.44 | 42.57</td> <td>7.15</td> <td>5.10</td> <td>14.56</td> <td>12.74</td> <td>21.83</td> <td><strong>5.51</strong></td> <td>53.17</td> <td>38.35</td> <td>5.17</td> <td>4.68</td> </tr> <tr> <td>Qwen2.5-Omni-3B</td> <td>1.51</td> <td>3.10 | 2.94 | 2.93</td> <td>3.32</td> <td>3.56</td> <td>0.82 | 7.82</td> <td>32.14 | 12.16</td> <td>22.91</td> <td>17.38</td> <td>12.96</td> <td>6.87 | 10.55 | 14.57 | 11.33</td> <td>54.54 | 50.03</td> <td>9.04</td> <td>5.45</td> <td>10.78</td> <td>10.94</td> <td>13.25</td> <td>7.67</td> <td>60.06</td> <td>45.00</td> <td>3.47</td> <td>5.54</td> </tr> <tr> <td>Qwen2.5-Omni-7B</td> <td>1.16</td> <td>2.88 | 2.77 | 2.73</td> <td>3.06</td> <td>3.16</td> <td>0.71 | 6.57</td> <td>32.03 | 18.73</td> <td>21.01</td> <td>19.96</td> <td>12.29</td> <td>7.27 | 10.94 | 12.92 | 10.53</td> <td>51.99 | 49.45</td> <td>8.43</td> <td>5.13</td> <td>14.02</td> <td>10.46</td> <td>14.42</td> <td>6.40</td> <td>57.43</td> <td>42.62</td> <td>2.75</td> <td>4.56</td> </tr> <tr> <td>Qwen3-Omni-30B-A3B-Instruct</td> <td>0.95</td> <td><strong>2.70</strong> | <strong>2.72</strong> | <strong>2.57</strong></td> <td>2.21</td> <td>2.47</td> <td><strong>0.59</strong> | <strong>3.22</strong></td> <td>25.72 | <strong>8.44</strong></td> <td><strong>18.15</strong></td> <td><strong>14.13</strong></td> <td><strong>8.79</strong></td> <td>6.20 | 8.88 | 11.59 | 10.25</td> <td>45.80 | 41.65</td> <td><strong>6.64</strong></td> <td>4.84</td> <td>12.94</td> <td>8.33</td> <td><strong>12.64</strong></td> <td>5.87</td> <td>25.39</td> <td>30.81</td> <td><strong>1.21</strong></td> <td>4.73</td> </tr> <tr> <td><strong>MOSS-Audio-4B-Instruct</strong></td> <td>2.26</td> <td>3.22 | 3.20 | 3.33</td> <td>3.53</td> <td>3.72</td> <td>0.73 | 5.86</td> <td>27.27 | 9.68</td> <td>20.33</td> <td>16.93</td> <td>13.25</td> <td>6.36 | 9.77 | 12.68 | 10.28</td> <td>43.35 | 44.25</td> <td>8.17</td> <td>8.13</td> <td>9.14</td> <td>8.37</td> <td>12.83</td> <td>14.65</td> <td>9.04</td> <td>18.47</td> <td>3.10</td> <td><strong>4.01</strong></td> </tr> <tr> <td><strong>MOSS-Audio-8B-Instruct</strong></td> <td>1.82</td> <td>2.97 | 2.95 | 2.91</td> <td>2.82</td> <td>3.20</td> <td>0.69 | 4.80</td> <td>36.82 | 11.25</td> <td>24.36</td> <td>17.42</td> <td>13.10</td> <td><strong>5.84</strong> | 8.94 | <strong>11.52</strong> | <strong>9.72</strong></td> <td><strong>39.76</strong> | <strong>39.27</strong></td> <td>7.86</td> <td>7.52</td> <td><strong>9.07</strong></td> <td>8.22</td> <td>13.26</td> <td>9.18</td> <td>8.33</td> <td><strong>17.24</strong></td> <td>2.39</td> <td>4.31</td> </tr> </table> </details>

Timestamp ASR (AAS↓)

| Model | AISHELL-1(zh) | LibriSpeech(en) | |---|---:|---:| | Qwen3-Omni-30B-A3B-Instruct | 833.66 | 646.95 | | Gemini-3.1-Pro| 708.24 | 871.19 | | MOSS-Audio-4B-Instruct | 76.96 | 358.13 | | MOSS-Audio-8B-Instruct | 35.77 | 131.61 |

Quickstart

Environment Setup

We recommend Python 3.12 with a clean Conda environment. The commands below are enough for local inference.

Recommended setup

git clone https://github.com/OpenMOSS/MOSS-Audio.git
cd MOSS-Audio

conda create -n moss-audio python=3.12 -y
conda activate moss-audio

conda install -c conda-forge "ffmpeg=7" -y
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime]"

Optional: FlashAttention 2

If your GPU supports FlashAttention 2, you can replace the last install command with:

pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime,flash-attn]"

Basic Usage

Download the model first:

huggingface-cli download OpenMOSS-Team/MOSS-Audio --local-dir ./weights/MOSS-Audio
huggingface-cli download OpenMOSS-Team/MOSS-Audio-Instruct --local-dir ./weights/MOSS-Audio-Instruct

Then edit MODEL_PATH / AUDIO_PATH in infer.py as needed, and run:

python infer.py

The default prompt in infer.py is Describe this audio. You can directly edit that line if you want to try transcription, audio QA, or speech captioning.

Gradio App

Start the Gradio demo with:

python app.py

SGLang Serving

If you want to serve MOSS-Audio with SGLang, see the full guide in moss_audio_usage_guide.md.

The shortest setup is:

git clone -b moss-audio https://github.com/OpenMOSS/sglang.git
cd sglang
pip install -e "python[all]"
pip install nvidia-cudnn-cu12==9.16.0.29
cd ..
sglang serve --model-path ./weights/MOSS-Audio --trust-remote-code

If you use the default torch==2.9.1+cu128 runtime, installing nvidia-cudnn-cu12==9.16.0.29 is recommended before starting sglang serve.

<a id="more-information"></a>

More Information

LICENSE

Models in MOSS-Audio are licensed under the Apache License 2.0.

Citation

@misc{mossaudio2026,
      title={MOSS-Audio Technical Report},
      author={OpenMOSS Team},
      year={2026},
      howpublished={\url{https://github.com/OpenMOSS/MOSS-Audio}},
      note={GitHub repository}
}

Author: OpenMOSS-Team

Likes: 6

Downloads: 0

Tags: safetensors, moss_audio, audio, speech, music, understanding, multimodal, reasoning, chain-of-thought, text-generation, conversational, custom_code, en, zh, license:apache-2.0, region:us

OpenMOSS-Team/MOSS-Audio-8B-Instruct


license: apache-2.0 language:

  • en
  • zh tags:
  • audio
  • speech
  • music
  • understanding
  • multimodal
  • instruct pipeline_tag: text-generation

MOSS-Audio

<p align="center"> <img src="./assets/moss-audio-logo.png" width="55%" /> </p> <div align="center"> <a href="https://huggingface.co/collections/OpenMOSS-Team/moss-audio"><img src="https://img.shields.io/badge/Huggingface-Models-orange?logo=huggingface&amp"></a> <img src="https://img.shields.io/badge/Blog-Coming_Soon-blue?logo=internet-explorer&amp"> <img src="https://img.shields.io/badge/Arxiv-Coming_Soon-red?logo=Arxiv&amp">

<a href="https://x.com/Open_MOSS"><img src="https://img.shields.io/badge/Twitter-Follow-black?logo=x&amp"></a> <a href="https://discord.gg/Xf3aXddCjc"><img src="https://img.shields.io/badge/Discord-Join-5865F2?logo=discord&amp"></a> <a href="./assets/wechat.jpg"><img src="https://img.shields.io/badge/WeChat-Join-07C160?logo=wechat&amp;logoColor=white" alt="WeChat"></a>

</div>

MOSS-Audio is an open-source audio understanding model from MOSI.AI, the OpenMOSS team, and Shanghai Innovation Institute. It performs unified modeling over complex real-world audio, supporting speech understanding, environmental sound understanding, music understanding, audio captioning, time-aware QA, and complex reasoning. In this release, we provide four models: MOSS-Audio-4B-Instruct, MOSS-Audio-4B-Thinking, MOSS-Audio-8B-Instruct, and MOSS-Audio-8B-Thinking. The Instruct variants are optimized for direct instruction following, while the Thinking variants provide stronger chain-of-thought reasoning capabilities.

News

  • 2026.4.13: πŸŽ‰πŸŽ‰πŸŽ‰ We have released MOSS-Audio. Blog and paper coming soon!

Contents

Introduction

<p align="center"> <img src="./assets/moss-audio-image.png" width="95%" /> </p>

Understanding audio requires more than simply transcribing words β€” it demands the ability to perceive acoustic cues, recognize speakers and emotions, interpret environmental sounds, reason over temporal context, and handle complex multi-step inference. MOSS-Audio is built to unify these capabilities within a single model.

  • Speech & Content Understanding: Accurately recognizes and transcribes spoken content from audio inputs, producing clean and well-structured text outputs. Supports both word-level and sentence-level timestamp alignment.
  • Speaker, Emotion & Event Analysis: Identifies speaker characteristics, analyzes emotional states based on tone, timbre, and context, and detects key acoustic events within the audio.
  • Scene & Sound Cue Extraction: Extracts meaningful cues from background sounds, environmental noise, music, and non-speech signals to infer scene context and atmosphere.
  • Music Understanding: Analyzes musical style, emotional progression, instrumentation, and salient acoustic features in music segments.
  • Audio Question Answering & Summarization: Answers questions and generates summaries about speech, podcasts, meetings, interviews, and environmental recordings, helping users efficiently extract key information.
  • Time-Aware QA: Supports time-aware questions, including word-level and sentence-level timestamp ASR.
  • Complex Reasoning: Performs multi-hop reasoning over audio content, powered by chain-of-thought training and reinforcement learning.

Model Architecture

<p align="center"> <img src="./assets/arc.png" width="95%" /> </p>

MOSS-Audio follows a modular design comprising three components: an audio encoder, a modality adapter, and a large language model. Raw audio is first encoded by MOSS-Audio-Encoder into continuous temporal representations at 12.5 Hz, which are then projected into the language model's embedding space through the adapter and finally consumed by the LLM for auto-regressive text generation.

Rather than relying on off-the-shelf audio frontends, we train a dedicated encoder from scratch to obtain more robust speech representations, tighter temporal alignment, and better extensibility across acoustic domains.

DeepStack Cross-Layer Feature Injection

Using only the encoder's top-layer features tends to lose low-level prosody, transient events, and local time-frequency structure. To address this, we design a DeepStack-inspired cross-layer injection module between the encoder and the language model: in addition to the encoder's final-layer output, features from earlier and intermediate layers are selected, independently projected, and injected into the language model's early layers, preserving multi-granularity information from low-level acoustic details to high-level semantic abstractions.

This design is especially well-suited for audio understanding tasks, as it helps retain rhythm, timbre, transients, and background structure β€” information that a single high-level representation cannot fully capture.

Time-Aware Representation

Time is a critical dimension in audio understanding. To enhance explicit temporal awareness, we adopt a time-marker insertion strategy during pretraining: explicit time tokens are inserted between audio frame representations at fixed time intervals to indicate temporal positions. This design enables the model to learn "what happened when" within a unified text generation framework, naturally supporting timestamp ASR, event localization, time-based QA, and long-audio retrospection.

Released Models

| Model | Audio Encoder | LLM Backbone | Total Size | Hugging Face | |---|---|---|---:|---| | MOSS-Audio-4B-Instruct | MOSS-Audio-Encoder | Qwen3-4B | ~4.6B | Hugging Face | MOSS-Audio-4B-Thinking | MOSS-Audio-Encoder | Qwen3-4B | ~4.6B | Hugging Face | MOSS-Audio-8B-Instruct | MOSS-Audio-Encoder | Qwen3-8B | ~8.6B | Hugging Face | MOSS-Audio-8B-Thinking | MOSS-Audio-Encoder | Qwen3-8B | ~8.6B | Hugging Face

More model families, sizes, and variants will be released in the future. Stay tuned!

Evaluation

We evaluate MOSS-Audio on a comprehensive set of audio understanding benchmarks. Key results:

  • General Audio Understanding: MOSS-Audio-8B-Thinking achieves an average accuracy of 70.80, outperforming all of the open-source models.
  • Speech Captioning: MOSS-Audio-Instruct variants lead across 11 out of 13 fine-grained speech description dimensions, with MOSS-Audio-8B-Instruct achieving the best overall average score (3.7252).
  • ASR: On a diverse ASR benchmark suite spanning 12 evaluation dimensions, MOSS-Audio achieves the lowest overall CER (11.30), with particular strength in health-condition, code-switching, dialect, singing, and non-speech scenarios.
  • Timestamp ASR: MOSS-Audio-8B-Instruct achieves 35.77 AAS on AISHELL-1 and 131.61 AAS on LibriSpeech, dramatically outperforming Qwen3-Omni (833.66) and Gemini-3.1-Pro (708.24) in timestamp asr accuracy.

General Audio Understanding (Accuracy↑)

<p align="center"> <img src="./assets/general_audio_bar.svg" width="75%" /> </p> <table> <thead> <tr> <th>Model</th> <th>Model Size</th> <th>MMAU</th> <th>MMAU-Pro</th> <th>MMAR</th> <th>MMSU</th> <th>Avg</th> </tr> </thead> <tbody> <tr><td colspan="7"><em><strong>Open Source (small)</strong></em></td></tr> <tr> <td>Kimi-Audio</td><td>7B</td><td>72.41</td><td>56.58</td><td>60.82</td><td>54.74</td><td>61.14</td> </tr> <tr> <td>Qwen2.5-Omni</td><td>7B</td><td>65.60</td><td>52.20</td><td>56.70</td><td>61.32</td><td>58.96</td> </tr> <tr> <td>Audio Flamingo 3</td><td>7B</td><td>61.23</td><td>51.70</td><td>57.96</td><td>60.04</td><td>57.73</td> </tr> <tr> <td>MiMo-Audio-7B</td><td>7B</td><td>74.90</td><td>53.35</td><td>61.70</td><td>61.94</td><td>62.97</td> </tr> <tr> <td>MiniCPM-o-4.5</td><td>9B</td><td>70.97</td><td>39.65</td><td>55.75</td><td>60.96</td><td>56.83</td> </tr> <tr> <td><strong>MOSS-Audio-4B-Instruct</strong></td><td><strong>4B</strong></td><td>75.79</td><td>58.16</td><td>59.68</td><td>59.68</td><td>64.04</td> </tr> <tr> <td><strong>MOSS-Audio-4B-Thinking</strong></td><td><strong>4B</strong></td><td><strong>77.64</strong></td><td>60.75</td><td>63.91</td><td>71.20</td><td>68.37</td> </tr> <tr> <td><strong>MOSS-Audio-8B-Instruct</strong></td><td><strong>8B</strong></td><td>77.03</td><td>57.48</td><td>64.42</td><td>66.36</td><td>66.32</td> </tr> <tr> <td><strong>MOSS-Audio-8B-Thinking</strong></td><td><strong>8B</strong></td><td>77.13</td><td><strong>64.29</strong></td><td><strong>65.73</strong></td><td><strong>76.06</strong></td><td><strong>70.80</strong></td> </tr> <tr><td colspan="7"><em><strong>Open Source (large)</strong></em></td></tr> <tr> <td>Qwen3-Omni-30B-A3B-Instruct</td><td>30B</td><td>75.00</td><td><strong>61.22</strong></td><td>66.40</td><td>69.00</td><td>67.91</td> </tr> <tr> <td>Step-Audio-R1.1</td><td>33B</td><td>72.18</td><td>60.80</td><td>68.75</td><td>64.18</td><td>66.48</td> </tr> <tr> <td>Step-Audio-R1</td><td>33B</td><td><strong>78.67</strong></td><td>59.68</td><td><strong>69.15</strong></td><td><strong>75.18</strong></td><td><strong>70.67</strong></td> </tr> <tr><td colspan="7"><em><strong>Closed Source</strong></em></td></tr> <tr> <td>GPT4o-Audio</td><td>-</td><td>65.66</td><td>52.30</td><td>59.78</td><td>58.76</td><td>59.13</td> </tr> <tr> <td>Gemini-3-Pro</td><td>-</td><td>80.15</td><td>68.28</td><td>81.73</td><td>81.28</td><td>77.86</td> </tr> <tr> <td>Gemini-3.1-Pro</td><td>-</td><td><strong>81.10</strong></td><td><strong>73.47</strong></td><td><strong>83.70</strong></td><td><strong>81.30</strong></td><td><strong>79.89</strong></td> </tr> </tbody> </table>

Speech Captioning (LLM-as-a-Judge Score↑)

<p align="center"> <img src="./assets/speech_caption_radar.png" width="70%" /> </p> <details> <summary><strong>Speech Captioning (click to expand)</strong></summary>

| Model | gender | age | accent | pitch | volume | speed | texture | clarity | fluency | emotion | tone | personality | summary | Avg | |---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:| | Qwen3-Omni-30B-A3B-Instruct | 4.436 | 3.936 | 4.356 | 3.590 | 3.682 | 3.614 | 3.093 | 3.521 | 3.531 | 3.328 | 3.224 | 3.292 | 3.179 | 3.5986 | | Qwen3-Omni-30B-A3B-Thinking | 4.419 | 4.026 | 4.327 | 3.610 | 3.577 | 3.610 | 3.179 | 3.403 | 3.526 | 3.232 | 3.154 | 3.197 | 3.107 | 3.5667 | | Gemini-3-Pro | 4.191 | 3.835 | 4.181 | 3.392 | 3.254 | 3.320 | 2.998 | 3.347 | 3.524 | 3.055 | 2.997 | 3.023 | 2.775 | 3.3763 | | Gemini-3.1-Pro| 4.436 | 3.936 | 4.356 | 3.590 | 3.682 | 3.614 | 3.093 | 3.521 | 3.531 | 3.328 | 3.224 | 3.292 | 3.179 | 3.5986 | | MOSS-Audio-4B-Instruct | 4.697 | 3.980 | 4.497 | 3.628 | 3.722 | 3.564 | 3.407 | 3.841 | 3.744 | 3.311 | 3.282 | 3.305 | 3.259 | 3.7105 | | MOSS-Audio-8B-Instruct | 4.683 | 3.979 | 4.572 | 3.682 | 3.709 | 3.638 | 3.403 | 3.869 | 3.747 | 3.314 | 3.253 | 3.272 | 3.307 | 3.7252 |

</details>

ASR

| Model | Overall | Health Condition | Dialect | Singing | Non-Speech Vocalizations | Code-Switching | Acoustic Environment (Clean) | Acoustic Environment (Noisy) | Acoustic Characteristics: Whisper | Acoustic Characteristics: Far-Field / Near-Field | Multi-Speaker | Age | Semantic Content | |---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:| | Paraformer-Large | 15.77 | 22.18 | 43.45 | 32.34 | 4.95 | 12.65 | 3.11 | 4.67 | 5.02 | 17.46 | 20.33 | 14.96 | 7.14 | | GLM-ASR-Nano | 17.29 | 24.49 | 22.39 | 51.95 | 4.65 | 11.88 | 3.68 | 5.02 | 4.94 | 27.51 | 28.02 | 17.19 | 7.32 | | Fun-ASR-Nano | 12.04 | 21.99 | 7.80 | 19.35 | 4.76 | 11.23 | 2.98 | 3.46 | 3.78 | 18.38 | 19.82 | 14.95 | 6.08 | | SenseVoice-Small | 14.50 | 24.04 | 8.89 | 23.79 | 4.92 | 13.90 | 4.13 | 4.93 | 5.57 | 26.66 | 24.06 | 17.63 | 7.55 | | Kimi-Audio-7B-Instruct | 14.12 | 21.11 | 29.34 | 21.76 | 4.68 | 16.38 | 2.20 | 2.15 | 2.66 | 21.02 | 20.61 | 16.74 | 6.12 | | Qwen2.5-Omni-3B | 15.26 | 24.65 | 33.87 | 24.24 | 5.54 | 11.66 | 2.76 | 3.56 | 4.32 | 22.15 | 22.91 | 15.17 | 7.24 | | Qwen2.5-Omni-7B | 15.05 | 23.85 | 31.91 | 22.69 | 4.56 | 12.97 | 2.52 | 3.16 | 3.64 | 25.38 | 21.01 | 16.13 | 6.78 | | Qwen3-Omni-30B-A3B-Instruct | 11.39 | 20.73 | 15.63 | 16.01 | 4.73 | 11.30 | 2.23 | 2.47 | 1.90 | 17.08 | 18.15 | 11.46 | 5.74 | | MOSS-Audio-4B-Instruct | 11.58 | 21.11 | 11.84 | 10.79 | 4.01 | 10.11 | 3.11 | 3.72 | 3.29 | 18.48 | 20.33 | 15.09 | 8.15 | | MOSS-Audio-8B-Instruct | 11.30 | 19.18 | 8.76 | 9.81 | 4.31 | 10.18 | 2.70 | 3.20 | 2.75 | 24.04 | 24.36 | 15.26 | 7.69 |

<details> <summary><strong>Detailed ASR Results (click to expand)</strong></summary> <table> <tr> <th rowspan="2">Model</th> <th colspan="3">Acoustic Environment (Clean)</th> <th colspan="1">Acoustic Environment (Noisy)</th> <th colspan="1">Acoustic Characteristics: Whisper</th> <th colspan="1">Acoustic Characteristics: Far-Field / Near-Field</th> <th colspan="1">Multi-Speaker</th> <th colspan="2">Age</th> <th colspan="2">Health Condition</th> <th colspan="2">Semantic Content</th> <th colspan="3">Code-Switching</th> <th colspan="2">Dialect</th> <th colspan="2">Singing</th> <th colspan="1">Non-Speech Vocalizations</th> </tr> <tr> <th>AISHELL-1<br><em>test</em></th> <th>AISHELL-2<br><em>Android | IOS | Mic</em></th> <th>THCHS-30<br><em>test</em></th> <th>MAGICDATA-READ<br><em>test</em></th> <th>AISHELL6-Whisper<br><em>normal | whisper</em></th> <th>AliMeeting<br><em>Test_Ali_far | Test_Ali_near</em></th> <th>AISHELL-4<br><em>test</em></th> <th>SeniorTalk<br><em>sentence</em></th> <th>ChildMandarin<br><em>test</em></th> <th>AISHELL-6A<br><em>mild | moderate | severe | StutteringSpeech</em></th> <th>AISHELL_6B<br><em>LRDWWS | Uncontrol</em></th> <th>WenetSpeech<br><em>test-meeting</em></th> <th>Fleurs<br><em>cmn_hans_cn</em></th> <th>CS-Dialogue<br><em>test</em></th> <th>TALCS<br><em>test</em></th> <th>ASCEND<br><em>test</em></th> <th>KeSpeech<br><em>test</em></th> <th>WSYue-ASR-eval<br><em>short</em></th> <th>MIR-1K<br><em>test</em></th> <th>openc-pop<br><em>test</em></th> <th>MNV_17</th> </tr> <tr> <td>Paraformer-Large</td> <td>1.98</td> <td>3.28 | 3.21 | 3.00</td> <td>4.07</td> <td>4.67</td> <td>1.11 | 8.92</td> <td><strong>25.64</strong> | 9.27</td> <td>20.33</td> <td>17.31</td> <td>12.60</td> <td>6.98 | 9.30 | 13.34 | 10.74</td> <td>47.59 | 45.08</td> <td>7.88</td> <td>6.40</td> <td>10.64</td> <td>10.77</td> <td>16.55</td> <td>11.48</td> <td>75.42</td> <td>57.70</td> <td>6.98</td> <td>4.95</td> </tr> <tr> <td>GLM-ASR-Nano</td> <td>2.89</td> <td>3.75 | 3.73 | 3.78</td> <td>4.23</td> <td>5.02</td> <td>0.83 | 9.06</td> <td>40.27 | 14.76</td> <td>28.02</td> <td>20.33</td> <td>14.06</td> <td>8.74 | 12.11 | 14.38 | 12.29</td> <td>50.34 | 49.09</td> <td>9.70</td> <td>4.94</td> <td>11.06</td> <td>11.07</td> <td>13.50</td> <td>9.72</td> <td>35.07</td> <td>95.87</td> <td>8.03</td> <td>4.65</td> </tr> <tr> <td>Fun-ASR-Nano</td> <td>2.16</td> <td>3.04 | 2.99 | 3.07</td> <td>3.65</td> <td>3.46</td> <td>0.81 | 6.76</td> <td>27.21 | 9.55</td> <td>19.82</td> <td>16.96</td> <td>12.94</td> <td>6.60 | <strong>8.81</strong> | 12.98 | 10.30</td> <td>47.42 | 45.84</td> <td>7.39</td> <td><strong>4.76</strong></td> <td>10.47</td> <td><strong>8.09</strong></td> <td>15.13</td> <td>7.43</td> <td>8.17</td> <td>35.85</td> <td>2.84</td> <td>4.76</td> </tr> <tr> <td>SenseVoice-Small</td> <td>3.23</td> <td>4.16 | 4.02 | 3.96</td> <td>5.26</td> <td>4.93</td> <td>1.25 | 9.88</td> <td>37.01 | 16.31</td> <td>24.06</td> <td>21.07</td> <td>14.18</td> <td>7.62 | 9.85 | 14.39 | 11.47</td> <td>52.92 | 47.97</td> <td>8.35</td> <td>6.75</td> <td>12.81</td> <td>10.52</td> <td>18.38</td> <td>10.45</td> <td><strong>7.34</strong></td> <td>39.51</td> <td>8.07</td> <td>4.92</td> </tr> <tr> <td>Kimi-Audio-7B-Instruct</td> <td><strong>0.79</strong></td> <td>2.91 | 3.03 | 2.88</td> <td><strong>1.39</strong></td> <td><strong>2.15</strong></td> <td>0.69 | 4.63</td> <td>28.22 | 13.82</td> <td>20.61</td> <td>19.70</td> <td>13.79</td> <td>7.00 | 9.34 | 12.56 | 10.75</td> <td>44.44 | 42.57</td> <td>7.15</td> <td>5.10</td> <td>14.56</td> <td>12.74</td> <td>21.83</td> <td><strong>5.51</strong></td> <td>53.17</td> <td>38.35</td> <td>5.17</td> <td>4.68</td> </tr> <tr> <td>Qwen2.5-Omni-3B</td> <td>1.51</td> <td>3.10 | 2.94 | 2.93</td> <td>3.32</td> <td>3.56</td> <td>0.82 | 7.82</td> <td>32.14 | 12.16</td> <td>22.91</td> <td>17.38</td> <td>12.96</td> <td>6.87 | 10.55 | 14.57 | 11.33</td> <td>54.54 | 50.03</td> <td>9.04</td> <td>5.45</td> <td>10.78</td> <td>10.94</td> <td>13.25</td> <td>7.67</td> <td>60.06</td> <td>45.00</td> <td>3.47</td> <td>5.54</td> </tr> <tr> <td>Qwen2.5-Omni-7B</td> <td>1.16</td> <td>2.88 | 2.77 | 2.73</td> <td>3.06</td> <td>3.16</td> <td>0.71 | 6.57</td> <td>32.03 | 18.73</td> <td>21.01</td> <td>19.96</td> <td>12.29</td> <td>7.27 | 10.94 | 12.92 | 10.53</td> <td>51.99 | 49.45</td> <td>8.43</td> <td>5.13</td> <td>14.02</td> <td>10.46</td> <td>14.42</td> <td>6.40</td> <td>57.43</td> <td>42.62</td> <td>2.75</td> <td>4.56</td> </tr> <tr> <td>Qwen3-Omni-30B-A3B-Instruct</td> <td>0.95</td> <td><strong>2.70</strong> | <strong>2.72</strong> | <strong>2.57</strong></td> <td>2.21</td> <td>2.47</td> <td><strong>0.59</strong> | <strong>3.22</strong></td> <td>25.72 | <strong>8.44</strong></td> <td><strong>18.15</strong></td> <td><strong>14.13</strong></td> <td><strong>8.79</strong></td> <td>6.20 | 8.88 | 11.59 | 10.25</td> <td>45.80 | 41.65</td> <td><strong>6.64</strong></td> <td>4.84</td> <td>12.94</td> <td>8.33</td> <td><strong>12.64</strong></td> <td>5.87</td> <td>25.39</td> <td>30.81</td> <td><strong>1.21</strong></td> <td>4.73</td> </tr> <tr> <td><strong>MOSS-Audio-4B-Instruct</strong></td> <td>2.26</td> <td>3.22 | 3.20 | 3.33</td> <td>3.53</td> <td>3.72</td> <td>0.73 | 5.86</td> <td>27.27 | 9.68</td> <td>20.33</td> <td>16.93</td> <td>13.25</td> <td>6.36 | 9.77 | 12.68 | 10.28</td> <td>43.35 | 44.25</td> <td>8.17</td> <td>8.13</td> <td>9.14</td> <td>8.37</td> <td>12.83</td> <td>14.65</td> <td>9.04</td> <td>18.47</td> <td>3.10</td> <td><strong>4.01</strong></td> </tr> <tr> <td><strong>MOSS-Audio-8B-Instruct</strong></td> <td>1.82</td> <td>2.97 | 2.95 | 2.91</td> <td>2.82</td> <td>3.20</td> <td>0.69 | 4.80</td> <td>36.82 | 11.25</td> <td>24.36</td> <td>17.42</td> <td>13.10</td> <td><strong>5.84</strong> | 8.94 | <strong>11.52</strong> | <strong>9.72</strong></td> <td><strong>39.76</strong> | <strong>39.27</strong></td> <td>7.86</td> <td>7.52</td> <td><strong>9.07</strong></td> <td>8.22</td> <td>13.26</td> <td>9.18</td> <td>8.33</td> <td><strong>17.24</strong></td> <td>2.39</td> <td>4.31</td> </tr> </table> </details>

Timestamp ASR (AAS↓)

| Model | AISHELL-1(zh) | LibriSpeech(en) | |---|---:|---:| | Qwen3-Omni-30B-A3B-Instruct | 833.66 | 646.95 | | Gemini-3.1-Pro| 708.24 | 871.19 | | MOSS-Audio-4B-Instruct | 76.96 | 358.13 | | MOSS-Audio-8B-Instruct | 35.77 | 131.61 |

Quickstart

Environment Setup

We recommend Python 3.12 with a clean Conda environment. The commands below are enough for local inference.

Recommended setup

git clone https://github.com/OpenMOSS/MOSS-Audio.git
cd MOSS-Audio

conda create -n moss-audio python=3.12 -y
conda activate moss-audio

conda install -c conda-forge "ffmpeg=7" -y
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime]"

Optional: FlashAttention 2

If your GPU supports FlashAttention 2, you can replace the last install command with:

pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime,flash-attn]"

Basic Usage

Download the model first:

huggingface-cli download OpenMOSS-Team/MOSS-Audio --local-dir ./weights/MOSS-Audio
huggingface-cli download OpenMOSS-Team/MOSS-Audio-Instruct --local-dir ./weights/MOSS-Audio-Instruct

Then edit MODEL_PATH / AUDIO_PATH in infer.py as needed, and run:

python infer.py

The default prompt in infer.py is Describe this audio. You can directly edit that line if you want to try transcription, audio QA, or speech captioning.

Gradio App

Start the Gradio demo with:

python app.py

SGLang Serving

If you want to serve MOSS-Audio with SGLang, see the full guide in moss_audio_usage_guide.md.

The shortest setup is:

git clone -b moss-audio https://github.com/OpenMOSS/sglang.git
cd sglang
pip install -e "python[all]"
pip install nvidia-cudnn-cu12==9.16.0.29
cd ..
sglang serve --model-path ./weights/MOSS-Audio --trust-remote-code

If you use the default torch==2.9.1+cu128 runtime, installing nvidia-cudnn-cu12==9.16.0.29 is recommended before starting sglang serve.

<a id="more-information"></a>

More Information

LICENSE

Models in MOSS-Audio are licensed under the Apache License 2.0.

Citation

@misc{mossaudio2026,
      title={MOSS-Audio Technical Report},
      author={OpenMOSS Team},
      year={2026},
      howpublished={\url{https://github.com/OpenMOSS/MOSS-Audio}},
      note={GitHub repository}
}

Author: OpenMOSS-Team

Likes: 6

Downloads: 0

Tags: safetensors, moss_audio, audio, speech, music, understanding, multimodal, instruct, text-generation, conversational, custom_code, en, zh, license:apache-2.0, region:us

huihui-ai/Huihui4-48B-A4B-abliterated


library_name: transformers license: apache-2.0 license_link: https://huggingface.co/huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated/blob/main/LICENSE pipeline_tag: image-text-to-text base_model:

  • huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated
  • TeichAI/gemma-4-26B-A4B-it-Claude-Opus-Distill tags:
  • abliterated
  • uncensored
  • Moe

huihui-ai/Huihui4-48B-A4B-abliterated

Model Overview

huihui-ai/Huihui4-48B-A4B-abliterated is a Mixture of Experts (MoE) language model developed by huihui.ai, built upon the huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated base model. It enhances the standard Transformer architecture by replacing MLP layers with MoE layers, each containing 256 experts, to achieve high performance with efficient inference. The model is designed for natural language processing tasks, including image-text-to-text generation, question answering, and conversational applications.

This is just a test. The exploration of merging different manifestations of models of the same type is another possibility.

Note All knowledge acquired from pre-training and fine-tuning remains completely intact and undamaged in the 256 expert modules. We only removed the safety gatekeeper (attention routing and refusal mechanisms) that controls whether the model is allowed to output that knowledge.

  • Architecture: Gemma4ForConditionalGeneration model with 256 experts per layer, activating 8 expert per token.
  • Total Parameters: ~48 billion (48)
  • Activated Parameters: ~4 billion (4B) during inference, comparable to google/gemma-4-26B-A4B-it
  • Developer: huihui.ai
  • Release Date: March 2026
  • License: Inherits the license of the gemma-4-26B-A4B-it base model (apache-2.0)

ollama

Please use the latest version of ollama

You can use huihui_ai/gemma-4-abliterated:48b directly,

ollama run huihui_ai/gemma-4-abliterated:48b

Expert Models:

Expert 1-128:

huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated

Expert 129-256:

TeichAI/gemma-4-26B-A4B-it-Claude-Opus-Distill

Instruction Following:

huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated

Training

  • Base Model: huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated
  • Conversion: The model copies embeddings, self-attention, and normalization weights from huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated, replacing MLP layers with MoE layers (256 experts).
  • Fine-Tuning: Not fine-tuned; users are recommended to fine-tune for specific tasks to optimize expert routing.

Applications

  • image-text-to-text Generation: Articles, dialogues, and creative writing.
  • Question Answering: Information retrieval and query resolution.
  • Conversational AI: Multi-turn dialogues for chatbots.
  • Research: Exploration of MoE architectures and efficient model scaling.

Limitations

  • Fine-Tuning Required: No weight averaging was performed for the merge; it was just a simple concatenation. without fine-tuning.
  • Compatibility: Developed with transformers 5.5.0; ensure matching versions to avoid loading issues.
  • Inference Speed: While efficient for an MoE model, performance depends on hardware (GPU recommended).

Ethical Considerations

  • Bias: Inherits potential biases from the gemma-4-26B-A4B-it-abliterated base model; users should evaluate outputs for fairness.
  • Usage: Intended for research and responsible applications; avoid generating harmful or misleading content.

Contact

  • Developer: huihui.ai
  • Repository: huihui-ai/Huihui4-48B-A4B-abliterated (available locally or on Hugging Face)
  • Issues: Report bugs or request features via the repository or please send an email to support@huihui.ai

Usage Warnings

  • Risk of Sensitive or Controversial Outputs: This model’s safety filtering has been significantly reduced, potentially generating sensitive, controversial, or inappropriate content. Users should exercise caution and rigorously review generated outputs.

  • Not Suitable for All Audiences: Due to limited content filtering, the model’s outputs may be inappropriate for public settings, underage users, or applications requiring high security.

  • Legal and Ethical Responsibilities: Users must ensure their usage complies with local laws and ethical standards. Generated content may carry legal or ethical risks, and users are solely responsible for any consequences.

  • Research and Experimental Use: It is recommended to use this model for research, testing, or controlled environments, avoiding direct use in production or public-facing commercial applications.

  • Monitoring and Review Recommendations: Users are strongly advised to monitor model outputs in real-time and conduct manual reviews when necessary to prevent the dissemination of inappropriate content.

  • No Default Safety Guarantees: Unlike standard models, this model has not undergone rigorous safety optimization. huihui.ai bears no responsibility for any consequences arising from its use.

Donation

Your donation helps us continue our further development and improvement, a cup of coffee can do it.
  • bitcoin:
  bc1qqnkhuchxw0zqjh2ku3lu4hq45hc6gy84uk70ge
  • Support our work on Ko-fi!

Author: huihui-ai

Likes: 5

Downloads: 0

Tags: transformers, safetensors, gemma4, image-text-to-text, abliterated, uncensored, Moe, conversational, base_model:TeichAI/gemma-4-26B-A4B-it-Claude-Opus-Distill, base_model:finetune:TeichAI/gemma-4-26B-A4B-it-Claude-Opus-Distill, license:apache-2.0, endpoints_compatible, region:us

mradermacher/gemma-4-26B-A4B-it-ultra-uncensored-heretic-i1-GGUF


base_model: llmfan46/gemma-4-26B-A4B-it-ultra-uncensored-heretic language:

  • en library_name: transformers license: apache-2.0 license_link: https://ai.google.dev/gemma/docs/gemma_4_license mradermacher: readme_rev: 1 quantized_by: mradermacher tags:
  • heretic
  • uncensored
  • decensored
  • abliterated
  • ara

About

<!-- ### quantize_version: 2 --> <!-- ### output_tensor_quantised: 1 --> <!-- ### convert_type: hf --> <!-- ### vocab_type: --> <!-- ### tags: nicoboss --> <!-- ### quants: Q2_K IQ3_M Q4_K_S IQ3_XXS Q3_K_M small-IQ4_NL Q4_K_M IQ2_M Q6_K IQ4_XS Q2_K_S IQ1_M Q3_K_S IQ2_XXS Q3_K_L IQ2_XS Q5_K_S IQ2_S IQ1_S Q5_K_M Q4_0 IQ3_XS Q4_1 IQ3_S --> <!-- ### quants_skip: --> <!-- ### skip_mmproj: -->

weighted/imatrix quants of https://huggingface.co/llmfan46/gemma-4-26B-A4B-it-ultra-uncensored-heretic

<!-- provided-files -->

For a convenient overview and download list, visit our model page for this model.

static quants are available at https://huggingface.co/mradermacher/gemma-4-26B-A4B-it-ultra-uncensored-heretic-GGUF

This is a vision model - mmproj files (if any) will be in the static repository.

Usage

If you are unsure how to use GGUF files, refer to one of TheBloke's READMEs for more details, including on how to concatenate multi-part files.

Provided Quants

(sorted by size, not necessarily quality. IQ-quants are often preferable over similar sized non-IQ quants)

| Link | Type | Size/GB | Notes | |:-----|:-----|--------:|:------| | GGUF | imatrix | 0.2 | imatrix file (for creating your own quants) | | GGUF | i1-IQ1_S | 8.4 | for the desperate | | GGUF | i1-IQ1_M | 8.8 | mostly desperate | | GGUF | i1-IQ2_XXS | 9.4 | | | GGUF | i1-IQ2_XS | 9.9 | | | GGUF | i1-IQ2_S | 10.0 | | | GGUF | i1-IQ2_M | 10.5 | | | GGUF | i1-Q2_K | 10.7 | IQ3_XXS probably better | | GGUF | i1-Q2_K_S | 10.7 | very low quality | | GGUF | i1-IQ3_XXS | 11.4 | lower quality | | GGUF | i1-IQ3_XS | 11.7 | | | GGUF | i1-IQ3_S | 12.3 | beats Q3_K* | | GGUF | i1-Q3_K_S | 12.3 | IQ3_XS probably better | | GGUF | i1-IQ3_M | 12.5 | | | GGUF | i1-Q3_K_M | 13.4 | IQ3_S probably better | | GGUF | i1-Q3_K_L | 13.9 | IQ3_M probably better | | GGUF | i1-IQ4_XS | 14.0 | | | GGUF | i1-Q4_0 | 14.6 | fast, low quality | | GGUF | i1-Q4_K_S | 15.6 | optimal size/speed/quality | | GGUF | i1-Q4_1 | 16.1 | | | GGUF | i1-Q4_K_M | 16.9 | fast, recommended | | GGUF | i1-Q5_K_S | 18.1 | | | GGUF | i1-Q5_K_M | 19.2 | | | GGUF | i1-Q6_K | 22.7 | practically like static Q6_K |

Here is a handy graph by ikawrakow comparing some lower-quality quant types (lower is better):

image.png

And here are Artefact2's thoughts on the matter: https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9

FAQ / Model Request

See https://huggingface.co/mradermacher/model_requests for some answers to questions you might have and/or if you want some other model quantized.

Thanks

I thank my company, nethype GmbH, for letting me use its servers and providing upgrades to my workstation to enable this work in my free time. Additional thanks to @nicoboss for giving me access to his private supercomputer, enabling me to provide many more imatrix quants, at much higher quality, than I would otherwise be able to.

<!-- end -->

Author: mradermacher

Likes: 4

Downloads: 0

Tags: transformers, gguf, heretic, uncensored, decensored, abliterated, ara, en, base_model:llmfan46/gemma-4-26B-A4B-it-ultra-uncensored-heretic, base_model:quantized:llmfan46/gemma-4-26B-A4B-it-ultra-uncensored-heretic, license:apache-2.0, endpoints_compatible, region:us, imatrix, conversational

mudler/MiniMax-M2.7-APEX-GGUF


license: other base_model: MiniMaxAI/MiniMax-M2.7 tags:

  • gguf
  • quantized
  • apex
  • moe
  • mixture-of-experts
  • minimax

MiniMax-M2.7 APEX GGUF

APEX (Adaptive Precision for EXpert Models) quantizations of MiniMax-M2.7.

Brought to you by the LocalAI team | APEX Project | Technical Report

Note: MiniMax M2 architecture support in llama.cpp is still maturing. If you encounter inference issues, ensure you're using a recent llama.cpp build (b8766+) and report issues upstream.

Benchmark Results

Benchmarks coming soon. For reference APEX benchmarks on the Gemma 4 26B-A4B architecture, see mudler/gemma-4-26B-A4B-it-APEX-GGUF.

Available Files

| File | Profile | Size | Best For | |------|---------|------|----------| | MiniMax-M2.7-APEX-I-Quality.gguf | I-Quality | 130 GB | Highest quality with imatrix | | MiniMax-M2.7-APEX-Quality.gguf | Quality | 130 GB | Highest quality standard | | MiniMax-M2.7-APEX-I-Balanced.gguf | I-Balanced | 155 GB | Best overall quality/size ratio | | MiniMax-M2.7-APEX-Balanced.gguf | Balanced | 155 GB | General purpose | | MiniMax-M2.7-APEX-I-Compact.gguf | I-Compact | 100 GB | Multi-GPU setups, best quality/size | | MiniMax-M2.7-APEX-Compact.gguf | Compact | 100 GB | Multi-GPU setups | | MiniMax-M2.7-APEX-I-Mini.gguf | I-Mini | 81 GB | Smallest viable |

What is APEX?

APEX is a quantization strategy for Mixture-of-Experts (MoE) models. It classifies tensors by role (routed expert, shared expert, attention) and applies a layer-wise precision gradient -- edge layers get higher precision, middle layers get more aggressive compression. I-variants use diverse imatrix calibration (chat, code, reasoning, tool-calling, agentic traces, Wikipedia).

See the APEX project for full details, technical report, and scripts.

Architecture

  • Model: MiniMax-M2.7 (MiniMaxM2)
  • Layers: 62
  • Experts: 256 routed (8 active per token)
  • Total Parameters: 228.7B
  • Active Parameters: ~45B per token
  • Source Format: FP8 (float8_e4m3fn, block-quantized 128x128)
  • Intermediate Format: BF16 (dequantized during conversion)
  • APEX Config: 5+5 symmetric edge gradient across 62 layers
  • Calibration: v1.2 diverse dataset (chat, code, reasoning, multilingual, tool-calling, Wikipedia)

Run with LocalAI

local-ai run mudler/MiniMax-M2.7-APEX-GGUF@MiniMax-M2.7-APEX-I-Balanced.gguf

Credits

APEX is brought to you by the LocalAI team. Developed through human-driven, AI-assisted research. Built on llama.cpp.

Author: mudler

Likes: 4

Downloads: 990

Tags: gguf, quantized, apex, moe, mixture-of-experts, minimax, base_model:MiniMaxAI/MiniMax-M2.7, base_model:quantized:MiniMaxAI/MiniMax-M2.7, license:other, endpoints_compatible, region:us, conversational

Surpem/Supertron1-4B


license: apache-2.0 language:

  • en base_model:
  • Qwen/Qwen3-4B pipeline_tag: text-generation library_name: transformers tags:
  • reasoning
  • math
  • coding
  • instruction-tuned
  • pytorch

Supertron1-4B: A Capable, Efficient Instruction-Tuned Language Model

Model Description

Supertron1-4B is an instruction-tuned language model built on top of Qwen3-4B. Designed to be a reliable, efficient daily driver, it delivers strong performance across math, coding, reasoning, and general conversation while remaining fast and lightweight enough to run on consumer hardware.

  • Developed by: Surpem
  • Model type: Causal Language Model
  • Architecture: Dense Transformer, 4B parameters
  • Fine-tuned from: Qwen/Qwen3-4B
  • License: Apache 2.0

Results

Supertron1-4B holds its own against models in the 4–8B class and surpasses Mistral 7B on all four core benchmarks despite having nearly half the parameters.

<div align="center"> <p align="center"><a href="https://postimg.cc/zLfkBN3D"><img width=800 src="https://i.postimg.cc/0NYXV2sS/1000037258.png"/></a></p> </div>

Key takeaways:

  • Beats Mistral 7B on every benchmark at 4B parameters
  • Strong GSM8K and HumanEval performance from math and coding focused tuning
  • Competitive with Phi-4 mini on a fraction of the compute

Get Started

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "surpem/supertron1-4b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [
    {"role": "user", "content": "Explain the difference between LoRA and full fine-tuning."}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

Citation

@misc{surpem2026supertron1,
      title={Supertron1-4B β€” Efficient Instruction-Tuned Language Model},
      author={Surpem},
      year={2026},
      url={https://huggingface.co/surpem/supertron1-4b},
}

Author: Surpem

Likes: 3

Downloads: 0

Tags: transformers, safetensors, qwen3, text-generation, reasoning, math, coding, instruction-tuned, pytorch, conversational, en, base_model:Qwen/Qwen3-4B, base_model:finetune:Qwen/Qwen3-4B, license:apache-2.0, text-generation-inference, endpoints_compatible, region:us

DaNS2025/AnimaYume.GGUF


license: fair-noncommercial-research-license base_model:

  • duongve/AnimaYume

Quantized in GGUF format using SD.cpp.

Send me a tip if this quantization helped you: https://ko-fi.com/xdnss

Sample

AnimaYume is a text-to-image model fine-tuned from Anima, a high-quality anime-style image generation model developed by CircleStone Labs. It builds upon Cosmos 2, a model developed by NVIDIA’s research team.

II. Model Components & Training Details

Text Encoder: Pre-trained Qwen-3-0.6b Variational Autoencoder: Pre-trained Qwen Image VAE Image Backbone: Fine-tune Anima Image Backbone III. Suggestion

Recommended Settings

CFG: 4–7

Sampling Steps: 25-40

Sampler: Euler a (with scheduler: normal)

Author: DaNS2025

Likes: 2

Downloads: 0

Tags: gguf, base_model:duongve/AnimaYume, base_model:quantized:duongve/AnimaYume, license:fair-noncommercial-research-license, region:us