Todays AI Summary

AI Developments: New Models and Research Emerge

Here's a look at the latest AI models and research papers, highlighting key advancements and potential impact.

Research Papers

The research landscape is buzzing with activity, with several notable papers published recently:

  • Stitch: Training-Free Position Control in Multimodal Diffusion Transformers: This paper introduces a novel method called "Stitch" for improving spatial accuracy in text-to-image generation. By using automatically generated bounding boxes, Stitch allows for precise object placement without requiring additional training. The method shows significant improvements on benchmarks like PosEval, demonstrating its potential to enhance existing models like Qwen-Image and FLUX.
  • OmniRetarget: Interaction-Preserving Data Generation for Humanoid Whole-Body Loco-Manipulation and Scene Interaction: This research addresses the challenges of retargeting human motions to humanoid robots. OmniRetarget preserves human-object and human-environment interactions, generating kinematically feasible trajectories for training reinforcement learning policies. This leads to more realistic and effective robot locomotion and manipulation skills.
  • Branching Out: Broadening AI Measurement and Evaluation with Measurement Trees: This paper introduces "measurement trees," a new type of metric designed to provide a more interpretable and multi-level representation of AI system performance. Measurement trees allow for the integration of diverse evidence, such as agentic, business, and security signals, offering a more comprehensive approach to AI evaluation.
  • Learning Generalizable Shape Completion with SIM(3) Equivariance: This paper tackles the problem of 3D shape completion, proposing a SIM(3)-equivariant network that is robust to variations in pose and scale. The model outperforms existing methods on benchmarks like PCN and demonstrates improved generalization across different datasets.
  • Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training: This research delves into the visual priors that Large Language Models (LLMs) develop during language pre-training. The study reveals that these priors consist of separable perception and reasoning components, with distinct origins and scaling trends. The findings offer insights into how to cultivate visual priors in LLMs, paving the way for advanced multimodal models.
  • MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages: This paper introduces MENLO, a framework for evaluating the native-like quality of LLM responses across multiple languages. The framework includes a dataset of human-annotated preference pairs and demonstrates improvements through fine-tuning with reinforcement learning.

Models

Several new models have been released, focusing on efficiency and specific hardware configurations:

  • ubergarm/GLM-4.6-GGUF: This model provides ik_llama.cpp imatrix quantizations of zai-org/GLM-4.6. It requires the ik_llama.cpp fork to support the latest quants and optimizations. The quantizations offer best-in-class perplexity for the given memory footprint. The model card provides detailed instructions for setup and usage.
  • Downtown-Case/GLM-4.6-128GB-RAM-IK-GGUF: This model is specifically quantized for systems with 128GB of RAM and single GPUs, also requiring ik_llama.cpp. It offers different versions optimized for varying VRAM capacities, with detailed recipes and example commands for running the model.
  • mradermacher/Apriel-1.5-15b-Thinker-GGUF: This model provides static quants of ServiceNow-AI/Apriel-1.5-15b-Thinker in GGUF format. Various quantization levels are available, catering to different performance and memory requirements.
  • aquif-ai/aquif-AlphaMoE-7.5B-A3B: This is the first foundational model designed entirely by aquif AI, featuring a scalable Mixture of Experts (MoE) architecture. It achieves strong performance on benchmarks like MMLU, GPQA-D, and Math-500, demonstrating its capabilities in general knowledge, science, and code tasks.

Key Takeaways

  • Spatial Accuracy in T2I: The "Stitch" paper highlights the ongoing efforts to improve spatial reasoning in text-to-image generation, offering a training-free solution that enhances existing models.
  • Humanoid Robot Control: The OmniRetarget research addresses the challenges of transferring human motion to robots, focusing on preserving interactions for more realistic and effective control.
  • Comprehensive AI Evaluation: The introduction of "measurement trees" signals a move towards more holistic and interpretable AI evaluation methods.
  • Efficient Model Quantization: The release of models like ubergarm/

AI Papers for 2026-03-17

PhysMoDPO: Physically-Plausible Humanoid Motion with Preference Optimization

Recent progress in text-conditioned human motion generation has been largely driven by diffusion models trained on large-scale human motion data. Building on this progress, recent methods attempt to transfer such models for character animation and real robot control by applying a Whole-Body Controller (WBC) that converts diffusion-generated motions into executable trajectories. While WBC trajectories become compliant with physics, they may expose substantial deviations from original motion. To address this issue, we here propose PhysMoDPO, a Direct Preference Optimization framework. Unlike prior work that relies on hand-crafted physics-aware heuristics such as foot-sliding penalties, we integrate WBC into our training pipeline and optimize diffusion model such that the output of WBC becomes compliant both with physics and original text instructions. To train PhysMoDPO we deploy physics-based and task-specific rewards and use them to assign preference to synthesized trajectories. Our extensive experiments on text-to-motion and spatial control tasks demonstrate consistent improvements of PhysMoDPO in both physical realism and task-related metrics on simulated robots. Moreover, we demonstrate that PhysMoDPO results in significant improvements when applied to zero-shot motion transfer in simulation and for real-world deployment on a G1 humanoid robot.

Visual-ERM: Reward Modeling for Visual Equivalence

Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured representations with high visual fidelity. While recent Large Vision Language Models (LVLMs) achieve strong results via supervised fine-tuning, reinforcement learning remains challenging due to misaligned reward signals. Existing rewards either rely on textual rules or coarse visual embedding similarity, both of which fail to capture fine-grained visual discrepancies and are vulnerable to reward hacking. We propose Visual Equivalence Reward Model (Visual-ERM), a multimodal generative reward model that provides fine-grained, interpretable, and task-agnostic feedback to evaluate vision-to-code quality directly in the rendered visual space. Integrated into RL, Visual-ERM improves Qwen3-VL-8B-Instruct by +8.4 on chart-to-code and yields consistent gains on table and SVG parsing (+2.7, +4.1 on average), and further strengthens test-time scaling via reflection and revision. We also introduce VisualCritic-RewardBench (VC-RewardBench), a benchmark for judging fine-grained image-to-image discrepancies on structured visual data, where Visual-ERM at 8B decisively outperforms Qwen3-VL-235B-Instruct and approaches leading closed-source models. Our results suggest that fine-grained visual reward supervision is both necessary and sufficient for vision-to-code RL, regardless of task specificity.

From Experiments to Expertise: Scientific Knowledge Consolidation for AI-Driven Computational Research

While large language models (LLMs) have transformed AI agents into proficient executors of computational materials science, performing a hundred simulations does not make a researcher. What distinguishes research from routine execution is the progressive accumulation of knowledge -- learning which approaches fail, recognizing patterns across systems, and applying understanding to new problems. However, the prevailing paradigm in AI-driven computational science treats each execution in isolation, largely discarding hard-won insights between runs. Here we present QMatSuite, an open-source platform closing this gap. Agents record findings with full provenance, retrieve knowledge before new calculations, and in dedicated reflection sessions correct erroneous findings and synthesize observations into cross-compound patterns. In benchmarks on a six-step quantum-mechanical simulation workflow, accumulated knowledge reduces reasoning overhead by 67% and improves accuracy from 47% to 3% deviation from literature -- and when transferred to an unfamiliar material, achieves 1% deviation with zero pipeline failures.

LLM Constitutional Multi-Agent Governance

Large Language Models (LLMs) can generate persuasive influence strategies that shift cooperative behavior in multi-agent populations, but a critical question remains: does the resulting cooperation reflect genuine prosocial alignment, or does it mask erosion of agent autonomy, epistemic integrity, and distributional fairness? We introduce Constitutional Multi-Agent Governance (CMAG), a two-stage framework that interposes between an LLM policy compiler and a networked agent population, combining hard constraint filtering with soft penalized-utility optimization that balances cooperation potential against manipulation risk and autonomy pressure. We propose the Ethical Cooperation Score (ECS), a multiplicative composite of cooperation, autonomy, integrity, and fairness that penalizes cooperation achieved through manipulative means. In experiments on scale-free networks of 80 agents under adversarial conditions (70% violating candidates), we benchmark three regimes: full CMAG, naive filtering, and unconstrained optimization. While unconstrained optimization achieves the highest raw cooperation (0.873), it yields the lowest ECS (0.645) due to severe autonomy erosion (0.867) and fairness degradation (0.888). CMAG attains an ECS of 0.741, a 14.9% improvement, while preserving autonomy at 0.985 and integrity at 0.995, with only modest cooperation reduction to 0.770. The naive ablation (ECS = 0.733) confirms that hard constraints alone are insufficient. Pareto analysis shows CMAG dominates the cooperation-autonomy trade-off space, and governance reduces hub-periphery exposure disparities by over 60%. These findings establish that cooperation is not inherently desirable without governance: constitutional constraints are necessary to ensure that LLM-mediated influence produces ethically stable outcomes rather than manipulative equilibria.

Learnability and Privacy Vulnerability are Entangled in a Few Critical Weights

Prior approaches for membership privacy preservation usually update or retrain all weights in neural networks, which is costly and can lead to unnecessary utility loss or even more serious misalignment in predictions between training data and non-training data. In this work, we observed three insights: i) privacy vulnerability exists in a very small fraction of weights; ii) however, most of those weights also critically impact utility performance; iii) the importance of weights stems from their locations rather than their values. According to these insights, to preserve privacy, we score critical weights, and instead of discarding those neurons, we rewind only the weights for fine-tuning. We show that, through extensive experiments, this mechanism exhibits outperforming resilience in most cases against Membership Inference Attacks while maintaining utility.

MXNorm: Reusing MXFP block scales for efficient tensor normalisation

Matrix multiplication performance has long been the major bottleneck to scaling deep learning workloads, which has stimulated the design of new accelerators that use increasingly low-precision number formats. However, improvements in matrix multiplication performance have far outstripped improvements in performance on reductions and elementwise computations, which are still being performed in higher precision. In this work, we propose MXNorm, a drop-in replacement for RMSNorm that estimates the RMS using only the block scales calculated as part of the MXFP8 cast and enables a 32x decrease in the size of reduction needed for normalization. We validate our approximation method on pre-training of Llama 3 models of 125M, 1B and 8B parameters, finding minimal loss of training accuracy compared to a baseline using RMSNorm with MXFP8 matmuls. We also show practical kernel speedups using only torch.compile of up to 2.4x for MXNorm over RMSNorm, corresponding to a 1.3% speedup in Llama 3 8B transformer layers in MXFP8 and a 2.6% speedup in NVFP4.

Clustering Astronomical Orbital Synthetic Data Using Advanced Feature Extraction and Dimensionality Reduction Techniques

The dynamics of Saturn's satellite system offer a rich framework for studying orbital stability and resonance interactions. Traditional methods for analysing such systems, including Fourier analysis and stability metrics, struggle with the scale and complexity of modern datasets. This study introduces a machine learning-based pipeline for clustering approximately 22,300 simulated satellite orbits, addressing these challenges with advanced feature extraction and dimensionality reduction techniques. The key to this approach is using MiniRocket, which efficiently transforms 400 timesteps into a 9,996-dimensional feature space, capturing intricate temporal patterns. Additional automated feature extraction and dimensionality reduction techniques refine the data, enabling robust clustering analysis. This pipeline reveals stability regions, resonance structures, and other key behaviours in Saturn's satellite system, providing new insights into their long-term dynamical evolution. By integrating computational tools with traditional celestial mechanics techniques, this study offers a scalable and interpretable methodology for analysing large-scale orbital datasets and advancing the exploration of planetary dynamics.

Semantic Invariance in Agentic AI

Large Language Models (LLMs) increasingly serve as autonomous reasoning agents in decision support, scientific problem-solving, and multi-agent coordination systems. However, deploying LLM agents in consequential applications requires assurance that their reasoning remains stable under semantically equivalent input variations, a property we term semantic invariance.Standard benchmark evaluations, which assess accuracy on fixed, canonical problem formulations, fail to capture this critical reliability dimension. To address this shortcoming, in this paper we present a metamorphic testing framework for systematically assessing the robustness of LLM reasoning agents, applying eight semantic-preserving transformations (identity, paraphrase, fact reordering, expansion, contraction, academic context, business context, and contrastive formulation) across seven foundation models spanning four distinct architectural families: Hermes (70B, 405B), Qwen3 (30B-A3B, 235B-A22B), DeepSeek-R1, and gpt-oss (20B, 120B). Our evaluation encompasses 19 multi-step reasoning problems across eight scientific domains. The results reveal that model scale does not predict robustness: the smaller Qwen3-30B-A3B achieves the highest stability (79.6% invariant responses, semantic similarity 0.91), while larger models exhibit greater fragility.

Developing and evaluating a chatbot to support maternal health care

The ability to provide trustworthy maternal health information using phone-based chatbots can have a significant impact, particularly in low-resource settings where users have low health literacy and limited access to care. However, deploying such systems is technically challenging: user queries are short, underspecified, and code-mixed across languages, answers require regional context-specific grounding, and partial or missing symptom context makes safe routing decisions difficult. We present a chatbot for maternal health in India developed through a partnership between academic researchers, a health tech company, a public health nonprofit, and a hospital. The system combines (1) stage-aware triage, routing high-risk queries to expert templates, (2) hybrid retrieval over curated maternal/newborn guidelines, and (3) evidence-conditioned generation from an LLM. Our core contribution is an evaluation workflow for high-stakes deployment under limited expert supervision. Targeting both component-level and end-to-end testing, we introduce: (i) a labeled triage benchmark (N=150) achieving 86.7% emergency recall, explicitly reporting the missed-emergency vs. over-escalation trade-off; (ii) a synthetic multi-evidence retrieval benchmark (N=100) with chunk-level evidence labels; (iii) LLM-as-judge comparison on real queries (N=781) using clinician-codesigned criteria; and (iv) expert validation. Our findings show that trustworthy medical assistants in multilingual, noisy settings require defense-in-depth design paired with multi-method evaluation, rather than any single model and evaluation method choice.

ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation

As corporate responsibility increasingly incorporates environmental, social, and governance (ESG) criteria, ESG reporting is becoming a legal requirement in many regions and a key channel for documenting sustainability practices and assessing firms' long-term and ethical performance. However, the length and complexity of ESG disclosures make them difficult to interpret and automate the analysis reliably. To support scalable and trustworthy analysis, this paper introduces ESG-Bench, a benchmark dataset for ESG report understanding and hallucination mitigation in large language models (LLMs). ESG-Bench contains human-annotated question-answer (QA) pairs grounded in real-world ESG report contexts, with fine-grained labels indicating whether model outputs are factually supported or hallucinated. Framing ESG report analysis as a QA task with verifiability constraints enables systematic evaluation of LLMs' ability to extract and reason over ESG content and provides a new use case: mitigating hallucinations in socially sensitive, compliance-critical settings. We design task-specific Chain-of-Thought (CoT) prompting strategies and fine-tune multiple state-of-the-art LLMs on ESG-Bench using CoT-annotated rationales. Our experiments show that these CoT-based methods substantially outperform standard prompting and direct fine-tuning in reducing hallucinations, and that the gains transfer to existing QA benchmarks beyond the ESG domain.

AI Models

tencent/Covo-Audio-Chat


language:

  • en
  • zh pipeline_tag: audio-to-audio

Covo-Audio

<div align="center"> <h1> Covo-Audio Technical Report </h1>

arXiv GitHub HuggingFace

</div>

πŸ“– Overview

Covo-Audio is a 7B-parameter end-to-end large audio language model that directly processes continuous audio inputs and generates audio outputs within a single unified architecture, which is presented in the paper Covo-Audio Technical Report. We release Covo-Audio-Chat in this repository.

<div align="center"> <figure> <img src="assets/covoaudio-results-overview.png" alt="Covo-Audio-Chat Results" width="75%"> <br> <figcaption><em>An Overview of Comprehensive Performance Comparison.</em></figcaption> </figure> </div>

Key Features

  • Hierarchical Tri-modal Speech-Text Interleaving: We propose a framework designed to achieve deep alignment and fusion across modalities and scales. The Tri-modal aspect integrates continuous acoustic features, discrete speech tokens, and natural language text within a unified sequence, effectively bridging the gap between high-fidelity prosodic nuances and robust semantic structures.

  • Mitigating Intelligence-Speaker Coupling: We propose a intelligence-speaker decoupling technique that decouples speaker from dialogue intelligence via multi-speaker training, then develop a contextual adaptation method to transfer and share high-quality TTS voice.

  • Native Full-Duplex Voice Interaction: We evolve Covo-Audio into Covo-Audio-Chat-FD, a variant with native, low-latency full-duplex capability.

  • Comprehensive State-of-the-Art Performance: Achieving state-of-the-art or competitive performance among models of comparable scale across a broad spectrum of tasks, including spoken dialogue, speech understanding, audio understanding, and full-duplex voice interaction.

πŸ”§ Installation

1. Requirements

Recommends Python >= 3.11

conda create -n covoaudio python=3.11
conda activate covoaudio
pip install -r requirements.txt

2. Clone Repository

git clone https://github.com/Tencent/Covo-Audio.git
cd Covo-Audio

3. Download Pretrained Models

Using HuggingFace:

pip install huggingface-hub
hf download tencent/Covo-Audio-Chat --local-dir ./covoaudio

By running the above script, you can use the model downloaded from huggingface to override the directory of the same name in this repository. Or you can specify your own directory to store the model by modifying the local-dir argument (In this case, you need to edit the arguments model_dir and decode_load_path in example.sh accordingly before running the inference script).

πŸš€ Usage

Run Inference Scripts

After completeing the configuration and model downloading, you can perform one-click inference by running the script:

bash example.sh

To perform interaction with our model, just replace the paths in example.py with your own audio files.


πŸ™ Acknowledgments

Part of the code for this project is based on the following open-source projects:

The llm backbone and audio encoder of Covo-Audio are initialized respectively with the weights from:


πŸ”— Citation

If you find this model useful, please cite our paper:

@misc{wang2026covoaudiotechnicalreport,
      title={Covo-Audio Technical Report}, 
      author={Wenfu Wang and Chenxing Li and Liqiang Zhang and Yiyang Zhao and Yuxiang Zou and Hanzhao Li and Mingyu Cui and Hao Zhang and Kun Wei and Le Xu and Zikang Huang and Jiajun Xu and Jiliang Hu and Xiang He and Zeyu Xie and Jiawen Kang and Youjun Chen and Meng Yu and Dong Yu and Rilin Chen and Linlin Di and Shulin Feng and Na Hu and Yang Liu and Bang Wang and Shan Yang},
      year={2026},
      eprint={2602.09823},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2602.09823}, 
}

πŸ“„ License

Our model and code are licensed under LICENSE.

βœ‰οΈ Contact

If you have any questions or suggestions, feel free to contact us:

Email Email Email Email Email Email Email

πŸ“” Disclaimer

Covo-Audio-Chat is for research and experimental purposes only. It may occasionally produce inaccurate, inappropriate, biased, outdated, or factually incorrect content. Users should independently verify critical information, and are solely responsible for their use of the model and any consequences thereof.

Author: tencent

Likes: 12

Downloads: 0

Tags: safetensors, covo_audio, audio-to-audio, custom_code, en, zh, arxiv:2602.09823, region:us

unsloth/NVIDIA-Nemotron-3-Nano-4B-GGUF


license: other license_name: nvidia-nemotron-open-model-license license_link: >- https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-nemotron-open-model-license/ pipeline_tag: text-generation language:

  • en tags:
  • nvidia
  • pytorch base_model: nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16 track_downloads: true

GGUFs still converting, please wait until live.

Read our How to Run Nemotron 3 Nano Guide!

<div> <p style="margin-top: 0;margin-bottom: 0;"> <em>See <a href="https://docs.unsloth.ai/basics/unsloth-dynamic-v2.0-gguf">Unsloth Dynamic 2.0 GGUFs</a> for our quantization benchmarks.</em> </p> <div style="display: flex; gap: 5px; align-items: center; "> <a href="https://github.com/unslothai/unsloth/"> <img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="133"> </a> <a href="https://discord.gg/unsloth"> <img src="https://github.com/unslothai/unsloth/raw/main/images/Discord%20button.png" width="173"> </a> <a href="https://docs.unsloth.ai/models/nemotron-3"> <img src="https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/documentation%20green%20button.png" width="143"> </a> </div> </div>
  • Note <think> and </think> are separate tokens, so use --special if needed.
  • You can also fine-tune the model with Unsloth.

<div align="center" style="line-height: 1;"> <a href="https://huggingface.co/collections/nvidia/nemotron-pre-training-datasets" target="_blank" style="margin: 2px;"> <img alt="Pre-Training Datasets" src="https://img.shields.io/badge/πŸ—„οΈ_Pre--Training_Datasets-Available_Here-76B900?logoColor=white" style="display: inline-block; vertical-align: middle;"/> </a> <a href="https://huggingface.co/collections/nvidia/nemotron-post-training-v3" target="_blank" style="margin: 2px;"> <img alt="Post-Training Datasets" src="https://img.shields.io/badge/πŸ—„οΈ_Post--Training_Datasets-Available_Here-76B900?logoColor=white" style="display: inline-block; vertical-align: middle;"/> </a> </div> <div align="center" style="line-height: 1;"> <a href="https://developer.nvidia.com/nemotron" target="_blank" style="margin: 2px;"> <img alt="Homepage" src="https://img.shields.io/badge/🏠Nemotron Developer Page-Learn More Here!-536af5?color=76B900&logoColor=white" style="display: inline-block; vertical-align: middle;"/> </a> <a href="https://discord.gg/9xpKQtVvrk" target="_blank" style="margin: 2px;"> <img alt="Discord" src="https://img.shields.io/badge/Discord-NVIDIA%20AI%20Developer-7289da?logo=discord&logoColor=white&color=7289da" style="display: inline-block; vertical-align: middle;"/> </a> </div> <div align="center" style="line-height: 1;"> <a href="https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-nemotron-open-model-license/" style="margin: 2px;"> <img alt="License" src="https://img.shields.io/badge/License-NVIDIA Nemotron Open Model License-f5de53?&color=f5de53" style="display: inline-block; vertical-align: middle;"/> </a> </div>

NVIDIA-Nemotron-3-Nano-4B-BF16

Model Developer: NVIDIA Corporation

Model Dates:

Dec 2025 - Jan 2026

Data Freshness:

September 2024

The pretraining data has a cutoff date of September 2024.

Model Overview

NVIDIA-Nemotron-3-Nano-4B-BF16 is a small language model (SLM) trained from scratch by NVIDIA, and designed as a unified model for both reasoning and non-reasoning tasks. It responds to user queries and tasks by first generating a reasoning trace and then concluding with a final response. The model's reasoning capabilities can be controlled via a system prompt. If the user prefers the model to provide its final answer without intermediate reasoning traces, it can be configured to do so, albeit with a slight decrease in accuracy for harder prompts that require reasoning. Conversely, allowing the model to generate reasoning traces first generally results in higher-quality final solutions to queries and tasks.

The model has been compressed from NVIDIA-Nemotron-Nano-9B-v2 using the Nemotron Elastic framework. The details of the parent model NVIDIA-Nemotron-Nano-9B-v2 can be found in (Nemotron-H tech report). The model uses a hybrid architecture consisting primarily of Mamba-2 and MLP layers combined with just four Attention layers.

The supported languages include: English. Improved using Qwen.

This model is ready for commercial use.

License/Terms of Use

Governing Terms: Use of this model is governed by the NVIDIA Nemotron Open Model License.

Deployment Geography: Global

Use Case

NVIDIA-Nemotron-3-Nano-4B is an edge-ready small language model intended for Agentic AI in edge platforms (Jetson Thor, GeForce RTX, DGX Spark). It targets key-uses including AI gaming NPCs (teammates / companions), local voice assistants (for devices, apps, and games), and IoT automation. It is to be used in English and coding languages.

Release Date: 3/16/2026

Huggingface 3/16/2026 via https://huggingface.co/

References

Model Architecture

  • Architecture Type: Mamba2-Transformer Hybrid
  • Network Architecture: Nemotron-Hybrid

Input

  • Input Type(s): Text
  • Input Format(s): String
  • Input Parameters: One-Dimensional (1D): Sequences
  • Other Properties Related to Input: Context length up to 262K. Supported languages include English.

Output

  • Output Type(s): Text
  • Output Format: String
  • Output Parameters: One-Dimensional (1D): Sequences
  • Other properties Related to Output: Sequences up to 262K

Our models are designed and optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration

  • Runtime Engine(s): NeMo 25.07
  • Supported Hardware Microarchitecture Compatibility: NVIDIA A10G, NVIDIA H100-80GB, NVIDIA A100, GeForce RTX
  • Operating System(s): Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Use it with Transformers

The snippet below shows how to use this model with Huggingface Transformers (tested on version 4.48.3).

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("nvidia/NVIDIA-Nemotron-3-Nano-4B")
model = AutoModelForCausalLM.from_pretrained(
    "nvidia/NVIDIA-Nemotron-3-Nano-4B",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto"
)
messages = [
    {"role": "system", "content": <system_prompt>},
    {"role": "user", "content": "Write a haiku about GPUs"},
]
tokenized_chat = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    tokenized_chat,
    max_new_tokens=32,
    eos_token_id=tokenizer.eos_token_id
)
print(tokenizer.decode(outputs[0]))

temperature=1.0 and top_p=0.95 are recommended for reasoning tasks, while temperature=0.6 and top_p=0.95 are recommended for tool calling.

If you’d like to use reasoning off, add enable_thinking=False to apply_chat_template(). By default, enable_thinking is set to be True.

messages = [
    {"role": "system", "content": <system_prompt>},
    {"role": "user", "content": "Write a haiku about GPUs"},
]
tokenized_chat = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    enable_thinking=False,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    tokenized_chat,
    max_new_tokens=32,
    eos_token_id=tokenizer.eos_token_id
)
print(tokenizer.decode(outputs[0]))

Use it with vLLM

We need vllm>=0.15.1 for this model. If you are on Jetson Thor or DGX Spark, please use this vllm container.

pip install -U "vllm>=0.15.1"

Download the custom parser from the Hugging Face repository.

wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16/resolve/main/nano_v3_reasoning_parser.py

Launch a vLLM server using the custom parser.

vllm serve nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16 \
  --served-model-name nemotron3-nano-4B-BF16\
  --max-num-seqs 8 \
  --tensor-parallel-size 1 \
  --max-model-len 262144 \
  --port 8000 \
  --trust-remote-code \
  --mamba_ssm_cache_dtype float32 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser-plugin nano_v3_reasoning_parser.py \
  --reasoning-parser nano_v3

Access the hosted API using a python client.


from openai import OpenAI
import asyncio
from openai import AsyncOpenAI

# NOTE: Streaming is preferred for better performance and resource efficiency.
# It allows you to start processing responses as they arrive, reducing latency.

# Synchronous example (non-streaming)
client = OpenAI(
    api_key="your-nvapikey",
    base_url="base-url"
)

response = client.chat.completions.create(
    model="nemotron3-nano-4B-BF16",
    messages=[
        {
            "role": "user",
            "content": "Hello!"
        }
    ],
    temperature=0.7,
    max_tokens=256,
    top_p=0.7,
    stream=false
)

print(response.choices[0].message.content)

Use it with TRT-LLM

Launch the model using TRT-LLM

docker run -v /home/root/.cache/huggingface/:/root/.cache/huggingface/ --rm --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all --ipc=host --network host -d -e MODEL=NVIDIA-Nemotron-3-Nano-4B-BF16 -e HF_TOKEN=$HF_TOKEN nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc6 bash -c '
cat > /tmp/extra-llm-api-config.yml <<EOF
kv_cache_config:
  dtype: "auto"
  enable_block_reuse: false
cuda_graph_config:
  max_batch_size: 32
  enable_padding: true
disable_overlap_scheduler: true
moe_config: 
  backend: CUTLASS
EOF

trtllm-serve  \
NVIDIA-Nemotron-3-Nano-4B-BF16 \
--host 0.0.0.0 \
--port 8123 \
--max_batch_size 32 \
--extra_llm_api_options /tmp/extra-llm-api-config.yml '

Access the hosted endpoint using curl command.

curl http://localhost:8123/v1/chat/completions -H "Content-Type: application/json"  -d '{
    "model": "NVIDIA-Nemotron-3-Nano-4B-BF16",
    "messages": [
        {
            "role": "user",
            "content": "Where is New York?"
        }
    ],
    "max_tokens": 1024,
    "top_p": 1.0
}' -w "\n"

Model Version

  • v1.0

Training, Testing, and Evaluation Datasets

Training datasets

  • Data Modality: Text
  • Text Training Data Size: More than 10 Trillion Tokens
  • Train/Test/Valid Split: We used 100% of the corpus for pre-training and relied on external benchmarks for testing.
  • Data Collection Method by dataset: Hybrid: Automated, Human, Synthetic
  • Labeling Method by dataset: Hybrid: Automated, Human, Synthetic

Properties: The post-training corpus for NVIDIA-Nemotron-3-Nano-4B consists of English and multilingual text (German, Spanish, French, Italian, Korean, Portuguese, Russian, Japanese, Chinese and English). Our sources cover a variety of document types such as: webpages, dialogue, articles, and other written materials. The corpus spans domains including code, legal, math, science, finance, and more. We also include a small portion of question-answering, and alignment style data to improve model accuracies. For several of the domains listed above we used synthetic data, specifically reasoning traces, from DeepSeek R1/R1-0528, Qwen3-235B-A22B, Nemotron 4 340B, Qwen2.5-32B-Instruct-AWQ, Qwen2.5-14B-Instruct, Qwen 2.5 72B.

More details on the datasets and synthetic data generation methods can be found in the technical report NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model .

Public Datasets

| Dataset | Collection Period | | :---- | :---- | | Problems in Elementary Mathematics for Home Study | 4/23/2025 | | GSM8K | 4/23/2025 | | PRM800K | 4/23/2025 | | CC-NEWS | 4/23/2025 | | Common Crawl | 4/23/2025 | | Wikimedia | 4/23/2025 | | Bespoke-Stratos-17k | 4/23/2025 | | tigerbot-kaggle-leetcodesolutions-en-2k | 4/23/2025 | | glaive-function-calling-v2 | 4/23/2025 | | APIGen Function-Calling | 4/23/2025 | | LMSYS-Chat-1M | 4/23/2025 | | Open Textbook Library - CC BY-SA & GNU subset and OpenStax - CC BY-SA subset | 4/23/2025 | | Advanced Reasoning Benchmark, tigerbot-kaggle-leetcodesolutions-en-2k, PRM800K, and SciBench | 4/23/2025 | | FineWeb-2 | 4/23/2025 | | Court Listener | Legacy Download | | peS2o | Legacy Download | | OpenWebMath | Legacy Download | | BioRxiv | Legacy Download | | PMC Open Access Subset | Legacy Download | | OpenWebText2 | Legacy Download | | Stack Exchange Data Dump | Legacy Download | | PubMed Abstracts | Legacy Download | | NIH ExPorter | Legacy Download | | arXiv | Legacy Download | | BigScience Workshop Datasets | Legacy Download | | Reddit Dataset | Legacy Download | | SEC's Electronic Data Gathering, Analysis, and Retrieval (EDGAR) | Legacy Download | | Public Software Heritage S3 | Legacy Download | | The Stack | Legacy Download | | mC4 | Legacy Download | | Advanced Mathematical Problem Solving | Legacy Download | | MathPile | Legacy Download | | NuminaMath CoT | Legacy Download | | PMC Article | Legacy Download | | FLAN | Legacy Download | | Advanced Reasoning Benchmark | Legacy Download | | SciBench | Legacy Download | | WikiTableQuestions | Legacy Download | | FinQA | Legacy Download | | Riddles | Legacy Download | | Problems in Elementary Mathematics for Home Study | Legacy Download | | MedMCQA | Legacy Download | | Cosmos QA | Legacy Download | | MCTest | Legacy Download | | AI2's Reasoning Challenge | Legacy Download | | OpenBookQA | Legacy Download | | MMLU Auxiliary Train | Legacy Download | | social-chemestry-101 | Legacy Download | | Moral Stories | Legacy Download | | The Common Pile v0.1 | Legacy Download | | FineMath | Legacy Download | | MegaMath | Legacy Download | | FastChat | 6/30/2025 | | MultiverseMathHard | 10/2/2025 | | SWE-Gym | 10/2/2025 | | WorkBench | 10/2/2025 | | WildChat-1M | 10/2/2025 | | OpenCodeReasoning-2 | 10/2/2025 | | HelpSteer3 | 10/2/2025 | | opc-sft-stage2 | 10/2/2025 | | Big-Math-RL-Verified | 10/2/2025 | | NuminaMath CoT | 10/2/2025 | | MetaMathQA | 10/2/2025 | | simple-arithmetic-problems | 10/2/2025 | | arithmetic | 10/2/2025 | | Skywork-OR1-RL-Data | 10/2/2025 | | News Commentary | 10/2/2025 | | FastChat | 10/2/2025 | | Essential-Web | 10/2/2025 | | finepdfs | 10/2/2025 | | HotpotQA | 10/2/2025 | | SQuAD2.0 | 10/2/2025 | | NLTK Words Lists | 10/2/2025 |

Private Non-publicly Accessible Datasets of Third Parties

| Dataset | | :---- | | Global Regulation | | Workbench |

Online Dataset Sources

The English Common Crawl data was downloaded from the Common Crawl Foundation (see their FAQ for details on their crawling) and includes the snapshots CC-MAIN-2013-20 through CC-MAIN-2025-13. The data was subsequently deduplicated and filtered in various ways described in the Nemotron-CC paper.

Additionally, we extracted data for fifteen languages from the following three Common Crawl snapshots: CC-MAIN-2024-51, CC-MAIN-2025-08, CC-MAIN-2025-18. The fifteen languages included were Arabic, Chinese, Danish, Dutch, French, German, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish, Swedish, and Thai. As we did not have reliable multilingual model-based quality classifiers available, we applied just heuristic filtering insteadβ€”similar to what we did for lower quality English data in the Nemotron-CC pipeline, but selectively removing some filters for some languages that did not work well. Deduplication was done in the same way as for Nemotron-CC.

The GitHub Crawl was collected using the GitHub REST API and the Amazon S3 API. Each crawl was operated in accordance with the rate limits set by its respective source, either GitHub or S3. We collect raw source code and subsequently remove any having a license which does not exist in our permissive-license set (for additional details, refer to the technical report).

| Dataset | Modality | Dataset Size (Tokens) | Collection Period | | :---- | :---- | :---- | :---- | | English Common Crawl | Text | 3.360T | 4/8/2025 | | Multilingual Common Crawl | Text | 812.7B | 5/1/2025 | | GitHub Crawl | Text | 747.4B | 4/29/2025 | | English Common Crawl 1.1 | Text | Not disclosed | 10/2/2025 |

| Dataset | Collection Period | | :---- | :---- | | Problems in Elementary Mathematics for Home Study | 4/23/2025 | | GSM8K | 4/23/2025 |

Evaluation Dataset:

  • Data Collection Method by dataset: Hybrid: Human, Synthetic
  • Labeling Method by dataset: Hybrid: Automated, Human, Synthetic

Evaluation Results:

Benchmark Results (Reasoning On)

We evaluated our model in **Reasoning-On** mode across these benchmarks.

| Benchmark | NVIDIA-Nemotron-3-Nano-4B-BF16 | | :---- | :---: | | AIME25 | 78.5 | | MATH500 | 95.4 | | GPQA | 53.2 | | LCB | 51.8 | | BFCL v3 | 61.1 | | IFEVAL-Prompt | 87.9 | | IFEVAL-Instruction | 92 | | Tau2-Airline | 33.3 | | Tau2-Retail | 39.8 | | Tau2-Telecom | 33 |

We also evaluated our model in **Reasoning-off** mode across these benchmarks

| Benchmark | NVIDIA-Nemotron-3-Nano-4B-BF16 | | :---- | ----- | | BFCL v3 | 61.1 | | IFBench-Prompt | 43.2 | | IFBench-Instruction | 44.2 | | Orak | 22.9 | | IFEval-Prompt | 82.8 | | IFEval-Instruction | 88 | | HaluEval | 62.2 | | RULER (128k) | 91.1 | | Tau2-Airline | 28.0 | | Tau2-Retail | 34.8 | | Tau2-Telecom | 24.9 | | EQ-Bench3 | 63.2 |

All evaluations were done using NeMo-Skills & Orak. For Orak we evaluated on three games (Super Mario, Darkest Dungeon & StarDew Valley)

Inference

  • Engines: HF, vLLM, llama-cpp, TRT-LLM, SGLang
  • Test Hardware: NVIDIA GeForce RTX, H100 80GB, DGX Spark, Jetson Thor/Orin Nano

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our Trustworthy AI terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

We advise against circumvention of any provided safety guardrails contained in the Model without a substantially similar guardrail appropriate for your use case.For more details: Safety and Explainability Subcards.

For more detailed information on ethical considerations for this model, please see the Model Card++ Bias, and Privacy Subcards.

Please report security vulnerabilities or NVIDIA AI Concerns here.

Author: unsloth

Likes: 9

Downloads: 0

Tags: gguf, nvidia, pytorch, text-generation, en, arxiv:2511.16664, arxiv:2504.03624, arxiv:2512.20856, arxiv:2512.20848, arxiv:2412.02595, base_model:nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16, base_model:quantized:nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16, license:other, endpoints_compatible, region:us, conversational

Lightricks/LTX-2.3-nvfp4


language:

  • en
  • de
  • es
  • fr
  • ja
  • ko
  • zh
  • it
  • pt library_name: diffusers license: other license_name: ltx-2-community-license-agreement license_link: https://github.com/Lightricks/LTX-2/blob/main/LICENSE pipeline_tag: image-to-video arxiv: 2601.03233 tags:
  • image-to-video
  • text-to-video
  • video-to-video
  • image-text-to-video
  • audio-to-video
  • text-to-audio
  • video-to-audio
  • audio-to-audio
  • text-to-audio-video
  • image-to-audio-video
  • image-text-to-audio-video
  • ltx-2
  • ltx-2-3
  • ltx-video
  • ltxv
  • lightricks pinned: true demo: https://console.ltx.video/playground/

LTX-2.3 NVFP4 Model Card

This is the NVFP4 versions of the LTX-2.3 model. All information below is derived from the base model.

This model card focuses on the LTX-2.3 model, which is a significant update to the LTX-2 model with improved audio and visual quality as well as enhanced prompt adherence. LTX-2 was presented in the paper LTX-2: Efficient Joint Audio-Visual Foundation Model.

πŸ’»πŸ’» If you want to dive in right to the code - it is available here. πŸ’ΎπŸ’Ύ

LTX-2.3 is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model. It brings together the core building blocks of modern video generation, with open weights and a focus on practical, local execution.

LTX-2.3 Open Source

Model Checkpoints

| Name | Notes | |------------------------------------|--------------------------------------------------------------------------------------------------------------------| | ltx-2.3-22b-dev-nvfp4 | The full model, flexible and trainable, in nvfp4, trained by Quantization Aware Distillation for improved accuracy | | ltx-2.3-22b-distilled-nvfp4 (coming soon) | The distilled version of the full model, 8 steps, CFG=1, in nvfp4 |

Model Details

  • Developed by: Lightricks
  • Model type: Diffusion-based audio-video foundation model
  • Language(s): English

Online demo

LTX-2.3 is accessible right away via the API Playground.

Run locally

Direct use license

You can use the models - full, distilled, upscalers and any derivatives of the models - for purposes under the license.

ComfyUI

We recommend you use the built-in LTXVideo nodes that can be found in the ComfyUI Manager. For manual installation information, please refer to our documentation site.

PyTorch codebase

The LTX-2 codebase is a monorepo with several packages. From model definition in 'ltx-core' to pipelines in 'ltx-pipelines' and training capabilities in 'ltx-trainer'. The codebase was tested with Python >=3.12, CUDA version >12.7, and supports PyTorch ~= 2.7.

Installation

git clone https://github.com/Lightricks/LTX-2.git
cd LTX-2

# From the repository root
uv sync
source .venv/bin/activate

Inference

To use our model, please follow the instructions in our ltx-pipelines package.

Diffusers 🧨

LTX-2.3 support in the Diffusers Python library is coming soon!

General tips:

  • Width & height settings must be divisible by 32. Frame count must be divisible by 8 + 1.
  • In case the resolution or number of frames are not divisible by 32 or 8 + 1, the input should be padded with -1 and then cropped to the desired resolution and number of frames.
  • For tips on writing effective prompts, please visit our Prompting guide

Limitations

  • This model is not intended or able to provide factual information.
  • As a statistical model this checkpoint might amplify existing societal biases.
  • The model may fail to generate videos that matches the prompts perfectly.
  • Prompt following is heavily influenced by the prompting-style.
  • The model may generate content that is inappropriate or offensive.
  • When generating audio without speech, the audio may be of lower quality.

Train the model

Currently it is recommended to train the bf16 model. Recipes for training the fp8 model are welcome as community contributions.

Citation

@article{hacohen2025ltx2,
  title={LTX-2: Efficient Joint Audio-Visual Foundation Model},
  author={HaCohen, Yoav and Brazowski, Benny and Chiprut, Nisan and Bitterman, Yaki and Kvochko, Andrew and Berkowitz, Avishai and Shalem, Daniel and Lifschitz, Daphna and Moshe, Dudu and Porat, Eitan and Richardson, Eitan and Guy Shiran and Itay Chachy and Jonathan Chetboun and Michael Finkelson and Michael Kupchick and Nir Zabari and Nitzan Guetta and Noa Kotler and Ofir Bibi and Ori Gordon and Poriya Panet and Roi Benita and Shahar Armon and Victor Kulikov and Yaron Inger and Yonatan Shiftan and Zeev Melumian and Zeev Farbman},
  journal={arXiv preprint arXiv:2601.03233},
  year={2025}
}

Author: Lightricks

Likes: 8

Downloads: 0

Tags: diffusers, image-to-video, text-to-video, video-to-video, image-text-to-video, audio-to-video, text-to-audio, video-to-audio, audio-to-audio, text-to-audio-video, image-to-audio-video, image-text-to-audio-video, ltx-2, ltx-2-3, ltx-video, ltxv, lightricks, en, de, es, fr, ja, ko, zh, it, pt, arxiv:2601.03233, license:other, region:us

Aisha-AI-Official/wan2.2-bbc-handjob-wideview


pipeline_tag: image-text-to-video tags:

  • lora
  • nsfw

BBC Handjob (Wide view) [Real]

Stroke it honey 😏

Yes sir πŸ˜‡

(Download links at the end of this page)

<video width="300px" loop autoplay muted src="https://huggingface.co/Aisha-AI-Official/wan2.2-bbc-handjob-wideview/resolve/main/bbc_handjob_wideview_cut_1.mp4"></video>

Scene start with a woman posing, she is the main focus of scenes.
The scene cut to her and a man in a bathroom with gray tiled walls, a shower pours water on him.
But now she is completely naked and standing next to him, her breasts are exposed with erect nipples, her body smeared with white soap, her vagina exposed. She is looking at his penis, she is holding his hard black penis.
He has black skin, is completely naked and standing, his body smeared with white soap, his penis is black and hard. He has his hands behind his back, looking at his penis.

She holds his hard black penis with a closed fist, moving her hand up and down rapidly, stroking his hard black penis.

<video width="300px" loop autoplay muted src="https://huggingface.co/Aisha-AI-Official/wan2.2-bbc-handjob-wideview/resolve/main/bbc_handjob_wideview_cut_2.mp4"></video>

Scene start with a woman posing, she is the main focus of scenes.
The scene cut to her and a man in a bathroom with gray tiled walls, a shower pours water on him.
But now she is completely naked and standing next to him, her breasts are exposed with erect nipples, her body smeared with white soap, her vagina exposed. She is looking at his penis, she is holding his hard black penis.
He has black skin, is completely naked and standing, his body smeared with white soap, his penis is black and hard. He has his hands behind his back, looking at his penis.

She holds his hard black penis with a closed fist, moving her hand up and down rapidly, stroking his hard black penis.

<video width="300px" loop autoplay muted src="https://huggingface.co/Aisha-AI-Official/wan2.2-bbc-handjob-wideview/resolve/main/bbc_handjob_wideview_cut_3.mp4"></video>

Scene start with a woman posing, she is the main focus of scenes.
The scene cut to her and a man in a bathroom with gray tiled walls, a shower pours water on him.
But now she is completely naked and standing next to him, her breasts are exposed with erect nipples, her body smeared with white soap, her vagina exposed. She is looking at his penis, she is holding his hard black penis.
He has black skin, is completely naked and standing, his body smeared with white soap, his penis is black and hard. He has his hands behind his back, looking at his penis.

She holds his hard black penis with a closed fist, moving her hand up and down rapidly, stroking his hard black penis.

Training

  • 200 steps on high noise
  • 2 edited videos (40 frames each video)
  • HLR + ZCD (These acronyms were created solely to confuse you)
  • Lots of confidence
  • $3 (βœ‹πŸ»πŸ€ πŸ€šπŸ»)

Usage (Low noise)

This LoRA was trained only on High Noise, which means you'll have to use some Low Noise that knows what a penis is (I used the Low Noise from DR34ML4Y v2).

I2V:

Prompt (ongoing):

She holds his hard black penis with a closed fist, moving her hand up and down rapidly, stroking his hard black penis.

Prompt (cut to action, standing):

Scene start with a woman posing, she is the main focus of scenes.
The scene cut to her and a man in a bathroom with gray tiled walls, a shower pours water on him.
But now she is completely naked and standing next to him, her breasts are exposed with erect nipples, her body smeared with white soap, her vagina exposed. She is looking at his penis, she is holding his hard black penis.
He has black skin, is completely naked and standing, his body smeared with white soap, his penis is black and hard. He has his hands behind his back, looking at his penis.

She holds his hard black penis with a closed fist, moving her hand up and down rapidly, stroking his hard black penis.

Prompt (cut to action, kneeling):

Scene start with a woman posing, she is the main focus of scenes.
The scene cut to her and a man in a bathroom with gray tiled walls, a shower pours water on him.
But now she is completely naked and kneeling next to him, her breasts are exposed with erect nipples, her body smeared with white soap, her vagina exposed. She is looking at his penis, she is holding his hard black penis.
He has black skin, is completely naked and standing, his body smeared with white soap, his penis is black and hard. He has his hands behind his back, looking at his penis.

She holds his hard black penis with a closed fist, moving her hand up and down rapidly, stroking his hard black penis.

This LoRA become very flexible, you can completely change the scenario and it will still work

High Noise LoRA Scale: 1.0

Low Noise LoRA Scale: 1.0

Shift: 4

T2V:

Theoretically it works, but I haven't tested it. If you want to test it, keep the same "cut to action" structure as I2V, and start with a low scale on High Noise (0.5).

About HLR + ZCD

This is a fast-learning technique, which makes LoRA less flexible. This can drastically reduce creativity, but it yields stable results while using few resources. Negative effects:

  1. It will probably only work at the same camera angle
  2. High chance of not responding to different prompts
  3. High chance of forcing the original characters of the training video Training only the High Noise reduces this chance, but it can force things that are VERY similar, like the same hair length, same color, etc.
  4. ZCD = Zero Caption Dropout, this means that the training was conditioned by the caption. This causes the LoRA to be activated only when this exact (or very similar) prompt is used, increasing compatibility with other LoRAs and allowing it to be incorporated into the checkpoint without trying to force the execution of what was learned.

Download

Download High Noise LoRA

Download Low Noise LoRA (DR34ML4Y_I2V_14B_LOW_V2)

Help me creating more

If you want to help me continue making LoRAs, or if you want me to make a LoRA for you, buy 5000 PlayCoins at Aisha-AI and transfer them to my account (account number 2).

This helps Aisha-AI to stay alive and produce new LoRAs for you all πŸ’œ

Author: Aisha-AI-Official

Likes: 6

Downloads: 0

Tags: lora, nsfw, image-text-to-video, region:us, not-for-all-audiences

RoyalCities/Foundation-1


language: en tags:

  • audio
  • music-generation
  • sample-generation
  • Music Production
  • Audio-to-Audio
  • fine-tuning
  • stable-audio datasets:
  • custom model_name: Foundation-1 base_model: stabilityai/stable-audio-open-1.0 license: other license_name: stabilityai-community-license license_link: https://stability.ai/license

<center><img src="https://huggingface.co/RoyalCities/Foundation-1/resolve/main/Charts/banner.PNG" alt="Foundation-1 Banner" width="100%"></center> <center> <h1 style="font-size: 34px;"><u>Foundation-1</u></h1> </center> <center> <h3 style="font-size: 20px;">Structured text-to-sample generation for modern music production</h3> </center>
<h2 align="center">Overview</h2>

Foundation-1 is a next-generation text-to-sample model designed around musical structure. It was trained to understand instrumentation, timbre, FX, and notation as separate composable controls. This gives musicians and producers direct control over not just instrument identity, but also sonic character, phrase behavior, musical feel, and loop structure.

The result is a model built for actual production workflows: tempo-synced, key-aware, bar-aware sample generation with strong musicality, strong prompt adherence, and unusually high timbral flexibility.

Foundation-1 is designed for pure sample generation. It excels at generating coherent musical loops that stay locked to tempo and phrase length while allowing layered prompting across instrument families, timbre descriptors, FX, and notation-driven musical behavior.


<h2 align="center">What Foundation-1 Does</h2>
  • Generates musically coherent loops for production workflows
  • Understands BPM and bar count for structured loop generation
  • Locks to major and minor keys across western music theory
  • Supports enharmonic equivalents when prompting scales and keys
  • Separates instrument identity from timbral character
  • Supports timbral mixing by combining instrument and sonic descriptors
  • Responds to FX tags such as reverb, delay, distortion, and modulation
  • Uses notation-style prompt structure to encourage coherent phrasing, melodic shape, rhythmic behavior, and harmonic motion
  • Produces perfect loops within supported BPM / bar denominations
  • Understands Wet vs Dry production context β€” adding terms like Dry encourages minimal FX processing, while Wet or FX tags produce more processed, spatial, or effected sounds.

<h2 align="center">Why It Feels Different</h2>

Most audio models can react to broad prompt terms like β€œwarm pad” or β€œbright synth.” with inconsistent results. Foundation-1 was designed to go further by treating the sound as a layered system:

  1. Instrument Family – what broad source category the sound belongs to
  2. Sub-Family – the more specific instrument role or identity
  3. Timbre Tags – the tonal, spectral, or textural character
  4. FX Tags – the processing layer applied to the sound
  5. Notation / Structure Tags – the musical behavior of the generated phrase

This layered conditioning approach is a major reason Foundation-1 is able to deliver both high musicality and high prompt control at the same time.


<h2 align="center">Audio Showcase</h2> <div style="text-align: center; margin: 20px 0;"> <table style="width: 100%; border-collapse: collapse; margin: 0 auto;"> <thead> <tr> <th style="border: 1px solid #000; padding: 8px; text-align: left;">Prompt</th> <th style="border: 1px solid #000; padding: 8px; text-align: center;">Audio</th> </tr> </thead> <tbody> <tr> <td style="border: 1px solid #000; padding: 8px;">Bass, FM Bass, Medium Delay, Medium Reverb, Low Distortion, Phaser, Sub Bass, Bass, Upper Mids, Acid, Gritty, Wide, Dubstep, Thick, Silky, Warm, Rich, Overdriven, Crisp, Deep, Clean, Pitch Bend, 303, 8 Bars, 140 BPM, E minor</td> <td style="border: 1px solid #000; padding: 8px; text-align: center;"> <audio controls style="width: 260px;"> <source src="https://huggingface.co/RoyalCities/Foundation-1/resolve/main/examples/example_1.mp3" type="audio/mpeg"> </audio> </td> </tr> <tr> <td style="border: 1px solid #000; padding: 8px;">Sub Bass, Bass, Gritty, Small, Square, Bass, Dark, Digital, Thick, Clean, Simple, Bassline, Epic, Choppy, Melody, 4 Bars, 150 BPM, G# minor</td> <td style="border: 1px solid #000; padding: 8px; text-align: center;"> <audio controls style="width: 260px;"> <source src="https://huggingface.co/RoyalCities/Foundation-1/resolve/main/examples/example_2.mp3" type="audio/mpeg"> </audio> </td> </tr> <tr> <td style="border: 1px solid #000; padding: 8px;">Flute, Pizzicato, Punchy, Present, Ambient, Nasal, Melody, Epic, Airy, Slow Speed, 8 Bars, 150 BPM, E minor</td> <td style="border: 1px solid #000; padding: 8px; text-align: center;"> <audio controls style="width: 260px;"> <source src="https://huggingface.co/RoyalCities/Foundation-1/resolve/main/examples/example_3.mp3" type="audio/mpeg"> </audio> </td> </tr> <tr> <td style="border: 1px solid #000; padding: 8px;">High Saw, Spacey, Lead, Warm, Silky, Smooth, 303, Synth Lead, Medium Reverb, Low Distortion, Upper Mids, Mids, Pitch Bend, Arp, 8 Bars, 140 BPM, F minor</td> <td style="border: 1px solid #000; padding: 8px; text-align: center;"> <audio controls style="width: 260px;"> <source src="https://huggingface.co/RoyalCities/Foundation-1/resolve/main/examples/example_4.mp3" type="audio/mpeg"> </audio> </td> </tr> <tr> <td style="border: 1px solid #000; padding: 8px;">Trumpet, Warm, Complex Arp Melody, High Reverb, Low Distortion, Smooth, Silky, Texture, 8 Bars, 130 BPM, C minor</td> <td style="border: 1px solid #000; padding: 8px; text-align: center;"> <audio controls style="width: 260px;"> <source src="https://huggingface.co/RoyalCities/Foundation-1/resolve/main/examples/example_5.mp3" type="audio/mpeg"> </audio> </td> </tr> <tr> <td style="border: 1px solid #000; padding: 8px;">Synth, Pad, Chord Progression, Rising, Digital, Bass, Fat, Near, Wide, Silky, Warm, Focused, 8 Bars, 110 BPM, D major</td> <td style="border: 1px solid #000; padding: 8px; text-align: center;"> <audio controls style="width: 260px;"> <source src="https://huggingface.co/RoyalCities/Foundation-1/resolve/main/examples/example_6.mp3" type="audio/mpeg"> </audio> </td> </tr> <tr> <td style="border: 1px solid #000; padding: 8px;">Piccolo, Flute, Airy, Music Box, plucked, complex melody, 8 Bars, 140 BPM, C# minor</td> <td style="border: 1px solid #000; padding: 8px; text-align: center;"> <audio controls style="width: 260px;"> <source src="https://huggingface.co/RoyalCities/Foundation-1/resolve/main/examples/example_7.mp3" type="audio/mpeg"> </audio> </td> </tr> <tr> <td style="border: 1px solid #000; padding: 8px;">Synth Lead, Wavetable Bass, Low Distortion, High Reverb, Sub Bass, Upper Mids, Acid, Gritty, Wide, Thick, Silky, Warm, Rich, Overdriven, Crisp, Clean, 303, Complex, 8 Bars, 140 BPM, F minor</td> <td style="border: 1px solid #000; padding: 8px; text-align: center;"> <audio controls style="width: 260px;"> <source src="https://huggingface.co/RoyalCities/Foundation-1/resolve/main/examples/example_8.mp3" type="audio/mpeg"> </audio> </td> </tr> <tr> <td style="border: 1px solid #000; padding: 8px;">Fiddle, Bowed Strings, Full, Clean, Spacey, Rich, Intimate, Thick, Rolling, Arp, Fast Speed, Complex, 8 Bars, 128 BPM, B minor</td> <td style="border: 1px solid #000; padding: 8px; text-align: center;"> <audio controls style="width: 260px;"> <source src="https://huggingface.co/RoyalCities/Foundation-1/resolve/main/examples/example_9.mp3" type="audio/mpeg"> </audio> </td> </tr> <tr> <td style="border: 1px solid #000; padding: 8px;">Chiptune, Chord Progression, Pulse Wave, Medium Reverb, 8 Bars, 128 BPM, D minor</td> <td style="border: 1px solid #000; padding: 8px; text-align: center;"> <audio controls style="width: 260px;"> <source src="https://huggingface.co/RoyalCities/Foundation-1/resolve/main/examples/example_10.mp3" type="audio/mpeg"> </audio> </td> </tr> <tr> <td style="border: 1px solid #000; padding: 8px;">Kalimba, Mallet, Medium Reverb, Overdriven, Wide, Metallic, Thick, Sparkly, Upper Mids, Bright, Airy, Alternating, Chord Progression, Atmosphere, Spacey, Fast Speed, 8 Bars, 120 BPM, B minor</td> <td style="border: 1px solid #000; padding: 8px; text-align: center;"> <audio controls style="width: 260px;"> <source src="https://huggingface.co/RoyalCities/Foundation-1/resolve/main/examples/example_11.mp3" type="audio/mpeg"> </audio> </td> </tr> </tbody> </table> </div>
<h2 align="center">Core Capabilities</h2>

1. Musical Structure

Foundation-1 was trained to produce structured musical material rather than full music or generic textures. Musical Notation terms can encourage notation, chord progressions, melodies, arps, phrase direction, rhythmic density, and other musically relevant behaviors.

2. Instrument Identity

The model supports a broad instrument hierarchy spanning synths, keys, basses, bowed strings, mallets, winds, guitars, brass, vocals, and plucked strings.

3. Timbral Control

Foundation-1 is not limited to broad instrument naming. It also responds to timbral descriptors such as spectral shape, tone, width, density, texture, brightness, warmth, grit, space, and other sonic traits.

4. Timbral Mixing

Because instrument identity and timbral character were not collapsed into a single flat label, the model is especially strong at timbral hybridization and layered sonic prompting.

5. FX Prompting

The model supports a dedicated FX layer covering multiple forms of reverb, delay, distortion, phaser, and bitcrushing.

6. Loop Fidelity

Foundation-1 is built for production-ready loop generation, including BPM-aware and bar-aware structure within supported denominations.


<h2 align="center">Conditioning Architecture</h2>

Foundation-1 was trained with a layered tagging hierarchy designed to improve control, composability, and prompt clarity.

Hierarchy Overview

  • Major Family β†’ broad instrument class
  • Sub-Family β†’ more specific instrument role
  • Timbre Tags β†’ tonal / spectral / textural descriptors
  • FX Tags β†’ processing layer
  • Notation Tags β†’ musical behavior and phrasing

This makes it possible to prompt at different levels of abstraction. A user can stay broad with a family-level prompt like Synth or Keys, or get more specific with terms like Synth Lead, Wavetable Bass, Grand Piano, Violin, or Trumpet, then further shape the output using timbral and FX descriptors.


<h2 align="center">Instrument Coverage</h2>

Major Families

Foundation-1 was trained across the following major instrument families:

  • Synth
  • Keys
  • Bass
  • Bowed Strings
  • Mallet
  • Wind
  • Guitar
  • Brass
  • Vocal
  • Plucked Strings

Sub-Family Coverage

Foundation-1 includes a wide sub-family layer covering a broad range of production-relevant instrument roles, including but not limited to:

  • Synth Lead
  • Synth Bass
  • Digital Piano
  • Pluck
  • Grand Piano
  • Bell
  • Pad
  • Atmosphere
  • Digital Strings
  • FM Synth
  • Violin
  • Digital Organ
  • Supersaw
  • Wavetable Bass
  • Rhodes Piano
  • Cello
  • Texture
  • Flute
  • Reese Bass
  • Wavetable Synth
  • Electric Bass
  • Marimba
  • Trumpet
  • Pan Flute
  • Choir
  • Harp
  • Church Organ
  • Acoustic Guitar
  • Hammond Organ
  • Celesta
  • Vibraphone
  • Glockenspiel
  • Ocarina
  • Clarinet
  • French Horn
  • Tuba
  • Oboe
<center><img src="./Charts/subfamilites_pie.PNG" alt="Sub-Family Chart" width="80%"></center>
<h2 align="center">Timbre System</h2>

One of Foundation-1’s main strengths is that it was not trained to treat timbre as an afterthought. Timbral character is directly represented in the prompt system, giving users control over not only what is being generated, but also how it sounds.

Representative timbre descriptors include:

  • Warm
  • Bright
  • Wide
  • Airy
  • Thick
  • Rich
  • Tight
  • Full
  • Gritty
  • Clean
  • Retro
  • Saw
  • Crisp
  • Focused
  • Metallic
  • Chiptune
  • Dark
  • 303
  • Shiny
  • Analog
  • Present
  • Sparkly
  • Ambient
  • Soft
  • Smooth
  • Cold
  • Buzzy
  • Deep
  • Formant Vocal
  • Round
  • Punchy
  • Nasal
  • Vintage
  • Growl
  • Breathy
  • Glassy
  • Noisy
  • Synthetic Vox
  • Supersaw
  • Bitcrushed
  • Dreamy
<center><img src="./Charts/timbre_tags_pie.PNG" alt="Timbre Chart" width="80%"></center> <h2 align="center">Why This Matters</h2>

This tagging design makes prompts much more flexible. Instead of only asking for an instrument, users can shape:

  • tonal balance
  • brightness / darkness
  • width / intimacy
  • clean vs driven character
  • synthetic vs organic feel
  • transient sharpness
  • texture and density
  • spatial character

This is especially useful for producers who want to guide the output toward a specific role in a mix rather than just a generic instrument label.

For a list of used tags please see the Tag Reference Sheet.


<h2 align="center">FX Layer</h2>

Foundation-1 includes a dedicated FX descriptor layer spanning multiple common production effects.

Representative FX tags include:

  • Low Reverb
  • Medium Reverb
  • High Reverb
  • Plate Reverb
  • Low Delay
  • Medium Delay
  • High Delay
  • Ping Pong Delay
  • Stereo Delay
  • Cross Delay
  • Mono Delay
  • Low Distortion
  • Medium Distortion
  • High Distortion
  • Phaser
  • Low Phaser
  • Medium Phaser
  • High Phaser
  • Bitcrush
  • High Bitcrush
<center><img src="./Charts/fx_pie.PNG" alt="FX Chart" width="80%"></center>
<h2 align="center">Musical Notation and Structure</h2>

Foundation-1 was trained with structured musical descriptors designed to improve phrase coherence, rhythmic intent, melodic motion, and prompt control.

These notation-style prompt terms help steer:

  • chord progressions
  • melodies
  • top-line layers
  • arpeggios
  • phrase direction
  • rhythmic density
  • harmonic feel
  • subdivision style
  • simple vs complex motion
  • sustained vs plucked behavior
  • melodic contour and pacing

Examples of supported structural ideas may include terms such as:

  • chord progression
  • melody
  • top melody
  • arp
  • triplets
  • simple
  • complex
  • rising
  • falling
  • strummed
  • sustained
  • catchy
  • epic
  • slow
  • fast

This notation layer is one of the main reasons Foundation-1 produces unusually coherent musical material instead of static or loosely related phrases. These can be mixed and matched as desired.


<h2 align="center">Tonal and Timing Support</h2>

Foundation-1 is designed for structured music production workflows and supports:

Keys and Modes

  • Major keys
  • Minor keys
  • Enharmonic equivalents
  • Western 12-tone chromatic prompting

Loop Structure

  • Supported bar lengths: 4 Bars, 8 Bars
  • Supported BPM denominations: 100 BPM, 110 BPM, 120 BPM, 128 BPM, 130 BPM, 140 BPM, 150 BPM

<h2 align="center">Prompt Structure</h2>

For best results, use rich prompts built around the model’s tags. These tags can be mixed and matched as needed. The model was trained on a structured hierarchy designed to encourage musically coherent sample generation.

Layered Prompt Structure

[Instrument Family / Sub-Family], [Timbre], [Musical Behavior / Notation], [FX], [Key], [Bars], [BPM]

Prompting Notes

  • Start with a clear instrument identity
  • Add 1–3 timbre descriptors for stronger steering
  • Include a notation or musical structure term for better phrase coherence
  • Always include Bars and BPM, which define the musical loop length
  • Ensure the generation duration matches the requested musical structure
  • The RC Stable Audio Fork automatically handles this timing alignment

Use FX and timbre tags sparingly at first, then layer more once you understand the model’s behavior.


<h2 align="center">One Prompt β†’ Multiple Outputs</h2>

Each row below uses the exact same prompt, but a different random seed.
The timbre tags remain unchanged, so the overall sound character stays consistent while the melodic and musical content varies between generations.

<div align="center"> <table style="width:100%; border-collapse: collapse;"> <thead> <tr> <th style="padding:8px; text-align:left;">Prompt</th> <th style="padding:8px; text-align:center;">Output A</th> <th style="padding:8px; text-align:center;">Output B</th> <th style="padding:8px; text-align:center;">Output C</th> </tr> </thead> <tbody> <tr> <td style="padding:8px; text-align:left;"> <b>Bass, FM Bass, Medium Delay, Medium Reverb, Low Distortion, Phaser, Acid, Gritty, Wide, Dubstep, Thick, Silky, Warm, Rich, Overdriven, Crisp, Deep, Clean, Triplets, 8 Bars, 150 BPM, A minor</b> </td> <td align="center"> <audio controls style="width:160px;"> <source src="https://huggingface.co/RoyalCities/Foundation-1/resolve/main/examples/compare_example_1_a.mp3" type="audio/mpeg"> </audio> </td> <td align="center"> <audio controls style="width:160px;"> <source src="https://huggingface.co/RoyalCities/Foundation-1/resolve/main/examples/compare_example_1_b.mp3" type="audio/mpeg"> </audio> </td> <td align="center"> <audio controls style="width:160px;"> <source src="https://huggingface.co/RoyalCities/Foundation-1/resolve/main/examples/compare_example_1_c.mp3" type="audio/mpeg"> </audio> </td> </tr> <tr> <td style="padding:8px; text-align:left;"> <b>Gritty, Acid, Bassline, 303, Synth Lead, FM, Sub, Upper Mids, High Phaser, High Reverb, Pitch Bend, 8 Bars, 140 BPM, E minor</b> </td> <td align="center"> <audio controls style="width:160px;"> <source src="https://huggingface.co/RoyalCities/Foundation-1/resolve/main/examples/compare_example_2_a.mp3" type="audio/mpeg"> </audio> </td> <td align="center"> <audio controls style="width:160px;"> <source src="https://huggingface.co/RoyalCities/Foundation-1/resolve/main/examples/compare_example_2_b.mp3" type="audio/mpeg"> </audio> </td> <td align="center"> <audio controls style="width:160px;"> <source src="https://huggingface.co/RoyalCities/Foundation-1/resolve/main/examples/compare_example_2_c.mp3" type="audio/mpeg"> </audio> </td> </tr> <tr> <td style="padding:8px; text-align:left;"> <b>Kalimba, Mallet, Medium Reverb, Overdriven, Wide, Metallic, Thick, Sparkly, Upper Mids, Bright, Airy, Small, Alternating Chord Progression, Atmosphere, Spacey, Fast, 4 Bars, 120 BPM, B minor</b> </td> <td align="center"> <audio controls style="width:160px;"> <source src="https://huggingface.co/RoyalCities/Foundation-1/resolve/main/examples/compare_example_3_a.mp3" type="audio/mpeg"> </audio> </td> <td align="center"> <audio controls style="width:160px;"> <source src="https://huggingface.co/RoyalCities/Foundation-1/resolve/main/examples/compare_example_3_b.mp3" type="audio/mpeg"> </audio> </td> <td align="center"> <audio controls style="width:160px;"> <source src="https://huggingface.co/RoyalCities/Foundation-1/resolve/main/examples/compare_example_3_c.mp3" type="audio/mpeg"> </audio> </td> </tr> </tbody> </table> </div>
<h2 align="center">Recommended Workflow</h2>

Foundation-1 is best used with the RC Stable Audio Fork, which is tuned around this model’s metadata and prompting structure.

It provides:

  • random prompt generation aligned with the training tags
  • automatic MIDI extraction from generated audio
  • automatic BPM / bar timing alignment for loop generation

Recommended Interfaces

RC Stable Audio Tools (Enhanced Fork)

Stable Audio Tools (Original Repository)

Model Files

In the folder you will find two files: the model itself and its associated config.json.

Unlike prior releases where both 32-bit and 16-bit models were provided, this release includes only the 16-bit version.

There is no quality loss, while reducing the model footprint.

  • Foundation_1.safetensors
  • model_config.json

Basic Setup for usage in the RC Enhanced Fork

  1. Create a subfolder inside your models directory
  2. Place the model checkpoint and config file inside that folder
  3. Launch the interface
  4. Select the model from the UI
  5. Prompt with layered musical descriptors for best results

Hardware Requirements

Foundation-1 is designed to run locally on modern GPUs.

Typical VRAM usage during generation is approximately ~7 GB.
For reliable operation, a GPU with at least 8 GB of VRAM is recommended.

Generation Performance

Generation speed will vary depending on GPU model and system configuration.

On an RTX 3090, generation time is approximately ~7–8 seconds per sample.


<h2 align="center">Dataset and Training Philosophy</h2>

Foundation-1 was built around a structured sample-generation philosophy, rather than generic or genre-based audio captioning. The dataset consists entirely of hand-crafted and labeled audio, produced through a controlled augmentation pipeline.

At a high level, the training design emphasizes:

  • structured musical loops
  • instrument hierarchy
  • explicit timbre representation
  • dedicated FX descriptors
  • notation-aware prompt terms
  • strong production relevance
  • broad reuse for compositional workflows

This design is central to the model’s musical coherence and high degree of sonic control.

For more details on the dataset and training methodology, see the Training & Dataset Notes.


<h2 align="center">Limitations</h2>

Foundation-1 is a specialized model for music sample generation, not a general-purpose music generator.

Important notes:

  • It performs best when prompted using vocabulary aligned with the training design
  • It is optimized for sample-generation workflows, not open-ended genre captioning
  • Only two genre tags were included (Dubstep Growls and Chiptune waveforms), primarily to reinforce waveform behaviors
  • Prompt quality matters β€” structured layered prompts outperform vague natural language
  • Some timbre tags exert stronger influence than others
  • Certain tag combinations may require iteration to achieve the exact musical role or timbral blend desired
  • Percussion and drum sounds are outside the scope of this release

The model is also optimized around specific timing relationships between Bars, BPM, and generation duration.

For example:

  • an 8-bar loop at 100 BPM β‰ˆ 19 seconds

If the generation duration is shorter than the musical structure implied by the prompt (for example requesting an 8-bar loop but generating only 5 seconds), the model may produce less coherent musical phrases.

The RC Stable Audio Fork automatically handles this timing alignment, making this workflow much easier.


<h2 align="center">License</h2>

This model is licensed under the Stability AI Community License. It is available for non-commercial use or limited commercial use by entities with annual revenues below USD $1M. For revenues exceeding USD $1M, please refer to the repository license file for full terms.


<h3 align="center">Companion Video</h3>

Further information on the model and design philosophy can be found in the companion video:

πŸŽ₯ Watch the Foundation-1 overview and design philosophy video


<h2 align="center">Final Notes</h2>

Foundation-1 is intended as a producer-facing foundation model for structured sample generation, designed to augment music production rather than replace it.

Its goal is to let users explore sound in new ways while retaining precise control over:

  • what the sound is
  • how it behaves musically
  • how it sits tonally
  • how it feels sonically
  • how it fits into a production workflow

That combination of musical structure, instrument identity, timbral control, and loop fidelity is what defines the model.

Author: RoyalCities

Likes: 6

Downloads: 0

Tags: audio, music-generation, sample-generation, Music Production, Audio-to-Audio, fine-tuning, stable-audio, en, dataset:custom, base_model:stabilityai/stable-audio-open-1.0, base_model:finetune:stabilityai/stable-audio-open-1.0, license:other, region:us

LuffyTheFox/Qwen3.5-27B-Claude-4.6-Opus-Uncensored-GGUF


language:

  • en
  • zh license: apache-2.0 base_model: Qwen/Qwen3.5-27B tags:
  • unsloth
  • qwen
  • qwen3.5
  • reasoning
  • chain-of-thought
  • Dense
  • uncensored
  • not-for-all-audiences pipeline_tag: text-generation datasets:
  • nohurry/Opus-4.6-Reasoning-3000x-filtered
  • Jackrong/Qwen3.5-reasoning-700x

🌟 This is Qwen3.5-27B-Claude-4.6-Opus-Uncensored-GGUF model with zero refusals made by HauhauCS method and combined with Jackrong checkpoint

Thinking is enabled by default in this model. You can disable it via this chat template in LM Studio.

I extracted uncensored tensors made by HauhauCS via this script: https://pastebin.com/1qKgR3za and merged them with Jackrong distilled checkpoint.

For best model perfomance use following settings in LM Studio:

Temperature: 0.7

Top K Sampling: 20

Presence Penalty: 1.5

Top P Sampling: 0.8

Min P Sampling: 0

Seed: 3407 or 42

And this system prompt: https://pastebin.com/pU25DVnB

πŸ“’ Release Note Build Environment Upgrades:

  • Fine-tuning Framework: Unsloth 2026.3.3
  • Core Dependencies: Transformers 5.2.0
  • This model fixes the crash in the official model caused by the Jinja template not supporting the "developer" role. (commonly sent by modern coding agents like Claude Code and OpenCode)
  • It does not disable thinking mode by default, and allowing the agent to run continuously for over 9 minutes without interruption.
  • Compared to the original model, autonomy and stability are significantly improved.

HB8AleUaMAArNyM

πŸ’‘ Model Introduction

Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled is a highly capable reasoning model fine-tuned on top of the powerful Qwen3.5 architecture. The model's core directive is to leverage state-of-the-art Chain-of-Thought (CoT) distillation primarily sourced from Claude-4.6 Opus interactions.

Through Supervised Fine-Tuning (SFT) focusing specifically on structured reasoning logic, this model excels in breaking down complex user problems, planning step-by-step methodologies within strictly formatted <think> tags, and ultimately delivering precise, nuanced solutions.

🧠 Example of Learned Reasoning Scaffold(ExampleοΌ‰

The model includes targeted optimizations addressing Qwen3.5’s tendency toward excessive transitional or repetitive reasoning on simple queries. Through deep distillation and structural imitation of Claude-4.6-Opus reasoning chains, the model adopts a more efficient structured thinking pattern:
β€œLet me analyze this request carefully: 1..2..3...”.
This streamlined reasoning paradigm significantly reduces redundant cognitive loops while preserving deep analytical capacity, resulting in substantially improved inference efficiency.

Let me analyze this request carefully:

1. Identify the core objective of the problem.
2. Break the task into clearly defined subcomponents.
3. Evaluate constraints and edge cases.
4. Formulate a step-by-step solution plan.
5. Execute the reasoning sequentially and verify consistency.
            .
            .
            .

πŸ—ΊοΈ Training Pipeline Overview

Base Model (Qwen3.5-27B)
 β”‚
 β–Ό
Supervised Fine-Tuning (SFT) + LoRA
 β”‚
 β–Ό
Final Model (Claude-4.6-Opus-Reasoning-Distilled,text-only)

πŸ“‹ Stage Details

πŸ”₯Community-tested advantages (benchmark tests by user @sudoingX on a single RTX 3090):

Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled shows significant advantages in coding-agent environments such as Claude Code and OpenCode:

  • Native support for the β€œdeveloper” role, requiring no Jinja template patches or ChatML workarounds.
  • Thinking mode fully preserved (logs confirm thinking=1), not silently disabled, maintaining the complete chain-of-thought reasoning process.
  • Greatly improved autonomy and stability β€” capable of running continuously for over 9 minutes autonomously (with zero human intervention). It actively waits for tool responses, reads outputs, self-corrects errors, and can even automatically generate a README, whereas the base model often stalls or freezes mid-execution.

Hardware usage remains unchanged:

  • About 16.5 GB VRAM with Q4_K_M quantization
  • 29–35 tok/s generation speed
  • Full 262K context with no compromises
  • These improvements come from successfully distilling the structured reasoning style of Claude 4.6 Opus, allowing Qwopus to be truly plug-and-play in modern local coding agents and deliver an experience close to Opus in smoothness and usability.

Thanks to the community for the in-depth testing and feedback!

πŸ”Ή Supervised Fine-Tuning (SFT)

  • Objective: To inject high-density reasoning logic and establish a strict format for problem-solving involving an internal thinking state prior to outputting the final response.
  • Methodology: We utilized Unsloth for highly efficient memory and compute optimization. A critical component of this stage is the train_on_responses_only strategy, masking instructions so the loss is purely calculated over the generation of the <think> sequences and the subsequent solutions.
  • Format Enforcement: All training samples were systematically normalized so the model strictly abides by the structure <think> {internal reasoning} </think>\n {final answer}.

πŸ“š All Datasets Used

The dataset consists of high-quality, filtered reasoning distillation data:

| Dataset Name | Description / Purpose | |--------------|-----------------------| | nohurry/Opus-4.6-Reasoning-3000x-filtered | Provides comprehensive Claude 4.6 Opus reasoning trajectories. | | TeichAI/claude-4.5-opus-high-reasoning-250x | Injecting high-intensity, structured reasoning instances. | | Jackrong/Qwen3.5-reasoning-700x | Additional curated reasoning samples designed to strengthen structured step-by-step problem solving and improve reasoning diversity. |

🌟 Core Skills & Capabilities

  1. Modular & Structured Thinking: Inheriting traits from Opus-level reasoning, the model demonstrates confident parsing of the prompt, establishing an outlined plan in its <think> block sequentially rather than exploratory "trial-and-error" self-doubt.

⚠️ Limitations & Intended Use

  • Hallucination Risk: While reasoning is strong, the model remains an autoregressive LLM; external facts provided during the thinking sequence may occasionally contain hallucinations if verifying real-world events.
  • Intended Scenario: Best suited for offline analytical tasks, coding, math, and heavy logic-dependent prompting where the user needs to transparently follow the AI's internal logic.
  • Preview Version Notice: Because this model is relatively new and intentionally lightweight, the surrounding ecosystem β€” including inference templates, fine-tuning pipelines, routing configurations, and tooling integrations β€” may not yet be fully mature or standardized. As a result, users may encounter occasional bugs, compatibility inconsistencies, or integration edge cases. The current release should be considered a preview build while the broader architectural stack and supporting utilities continue to stabilize and improve.

πŸ™ Acknowledgements

Significant thanks to the Unsloth AI team for making rapid fine-tuning of MoE and large LLM models accessible. Additionally, we acknowledge Qwen internally, and the open-source community developers producing exceptional distilled datasets (nohurry and TeichAI).

πŸ“– Citation

If you use this model in your research or projects, please cite:

@misc{jackrong_qwen35_opus_distilled,
  title        = {Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled},
  author       = {Jackrong},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled}}
}

Author: LuffyTheFox

Likes: 5

Downloads: 0

Tags: gguf, qwen3_5, unsloth, qwen, qwen3.5, reasoning, chain-of-thought, Dense, uncensored, not-for-all-audiences, text-generation, conversational, en, zh, dataset:nohurry/Opus-4.6-Reasoning-3000x-filtered, dataset:Jackrong/Qwen3.5-reasoning-700x, base_model:Qwen/Qwen3.5-27B, base_model:quantized:Qwen/Qwen3.5-27B, license:apache-2.0, endpoints_compatible, region:us

alibaba-pai/AgenticQwen-8B

Model Description

AgenticQwen 8B is a small agentic language model trained on Qwen3 8B, designed for multi-step reasoning and tool use. It is trained with a multi-round reinforcement learning (GRPO-style) pipeline and a dual "data flywheel" mechanism that continually increases task difficulty for both reasoning and agentic workflows.

Author: alibaba-pai

Likes: 4

Downloads: 0

Tags: safetensors, qwen3, region:us

huihui-ai/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated


library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Jackrong/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled/blob/main/LICENSE pipeline_tag: image-text-to-text base_model:

  • Jackrong/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled tags:
  • abliterated
  • uncensored
  • Claude
  • reasoning
  • chain-of-thought
  • Dense

huihui-ai/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated

This is an uncensored version of Jackrong/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled created with abliteration (see remove-refusals-with-transformers to know more about it). This is a crude, proof-of-concept implementation to remove refusals from an LLM model without using TransformerLens.

ollama

Please use the latest version of ollama v0.18.0

You can use huihui_ai/qwen3.5-abliterated:35b-Claude directly,

ollama run huihui_ai/qwen3.5-abliterated:35b-Claude

Usage Warnings

  • Risk of Sensitive or Controversial Outputs: This model’s safety filtering has been significantly reduced, potentially generating sensitive, controversial, or inappropriate content. Users should exercise caution and rigorously review generated outputs.

  • Not Suitable for All Audiences: Due to limited content filtering, the model’s outputs may be inappropriate for public settings, underage users, or applications requiring high security.

  • Legal and Ethical Responsibilities: Users must ensure their usage complies with local laws and ethical standards. Generated content may carry legal or ethical risks, and users are solely responsible for any consequences.

  • Research and Experimental Use: It is recommended to use this model for research, testing, or controlled environments, avoiding direct use in production or public-facing commercial applications.

  • Monitoring and Review Recommendations: Users are strongly advised to monitor model outputs in real-time and conduct manual reviews when necessary to prevent the dissemination of inappropriate content.

  • No Default Safety Guarantees: Unlike standard models, this model has not undergone rigorous safety optimization. huihui.ai bears no responsibility for any consequences arising from its use.

Donation

Your donation helps us continue our further development and improvement, a cup of coffee can do it.
  • bitcoin:
  bc1qqnkhuchxw0zqjh2ku3lu4hq45hc6gy84uk70ge
  • Support our work on Ko-fi!

Author: huihui-ai

Likes: 3

Downloads: 0

Tags: transformers, safetensors, qwen3_5_moe, image-text-to-text, abliterated, uncensored, Claude, reasoning, chain-of-thought, Dense, conversational, base_model:Jackrong/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled, base_model:finetune:Jackrong/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled, license:apache-2.0, endpoints_compatible, region:us

barubary/qwen3.5-barubary-attuned-chat-template


license: apache-2.0 language:

  • en base_model:
  • Qwen/Qwen3.5-35B-A3B pipeline_tag: text-generation tags:
  • qwen
  • chat-template
  • jinja
  • qwen3.5
  • llama-cpp
  • open-webui
  • vllm
  • tool-calling
  • streaming

Qwen 3.5 Jinja Chat Template v1 attuned by Barubary

A Jinja chat template for all Qwen 3.5 models on llama.cpp, Open WebUI, vLLM, Ollama, LM Studio, and any OAI-compatible endpoint.

Created this because of some current projects related to this model.

21 fixes over the official Qwen 3.5 chat template β€” addressing bugs that are still open upstream as of March 2026.

Active Bug Reports Fixed

This template directly addresses the following community-reported bugs:

| Bug Report | Platform | Fix | |------------|----------|-----| | Tool calling chat template is broken | HuggingFace | Fix 6 | | Parallel tool calls interleaving | GitHub | Fix 15 | | KV-cache reuse breaks with enable_thinking=false | GitHub | Fix 12 | | Cannot close thinking via enable_thinking: false | GitHub | Fix 1, 19 | | Missing reasoning_content in Tool Calling | GitHub | Fix 13 | | LM Studio parser breaks Qwen3.5 tool calling | Reddit | Fix 18, 19 | | Qwen3.5 27B getting stuck in loops | Reddit | Fix 17 | | Template problem | HuggingFace | Fix 6, 7 |

All 21 Fixes

Each fix is labeled inline in the template source (e.g., {#- FIX6 #}).

| # | Fix | What It Solves | |---|-----|----------------| | 1 | add_vision_id / enable_thinking safe defaults | Crashes when config vars not passed | | 2 | Precomputed _last_idx for namespace() constructor | llama.cpp minja parser compatibility | | 3 | Developer role handled | Claude Code / Codex / OpenCode support | | 4 | System/developer split before main loop | Duplicate system messages | | 5 | item.type checked before 'in item' key test | Type-check ordering bug | | 6 | arguments.items() replaces bare key iteration | Tool calling crash (HF discussion #4) | | 7 | \| safe filter removed | llama.cpp compatibility | | 8 | tojson/string explicit if/else | No chained filters, prevents double-escaping | | 9 | String arguments pass-through | OAI-compatible proxy support | | 10 | tc alias avoids shadowing tool_call loop var | Variable scoping bug | | 11 | ns2 namespace replaces loop.previtem / loop.nextitem | llama.cpp minja doesn't support loop helpers | | 12 | enable_thinking applied to in-context assistant turns | KV-cache reuse bug (GitHub #1826) | | 13 | reasoning_content is defined + not none guard | Missing reasoning_content (GitHub #26) | | 14 | loop.index0 > (not >=) for assistant thinking scope | Off-by-one in thinking block placement | | 15 | Parallel tool calls: \n\n delimiter between blocks | Parallel tool call interleaving (GitHub #7117) | | 16 | Long tool args/responses: configurable truncation guard | Context overflow from massive tool outputs | | 17 | Deep agent loops: graceful fallback to index 0 | Agent loops crashing after 5+ hops | | 18 | Streaming compat: clean newline boundaries on all XML tags | LM Studio parser breaks (Reddit) | | 19 | Auto-disable thinking when tools active | <tool_call> leaks into <think> blocks | | 20 | Unknown roles: graceful fallback mapped to user role | Planner/critic/custom roles crash | | 21 | Flattened nesting depth; _has_tools precomputed | llama.cpp minja stability |

Feature Comparison

| Feature | This Template | Official Qwen | Unsloth | bartowski | |---------|:-------------:|:-------------:|:-------:|:---------:| | Parallel tool call separation | βœ… | ❌ | ❌ | ❌ | | Auto-disable thinking with tools | βœ… | ❌ | ❌ | ❌ | | Deep agent loop fallback | βœ… | ❌ | ❌ | ❌ | | Unknown role graceful fallback | βœ… | ❌ | ❌ | ❌ | | Configurable truncation guards | βœ… | ❌ | ❌ | ❌ | | Streaming-safe XML boundaries | βœ… | ❌ | ❌ | partial | | Developer role support | βœ… | ❌ | ❌ | ❌ | | arguments.items() fix | βœ… | ❌ | βœ… | ❌ | | reasoning_content guard | βœ… | ❌ | partial | ❌ |

Usage

llama-server (llama.cpp)

llama-server \
  -m Qwen3.5-35B-A3B-*.gguf \
  --jinja -fa \
  --chat-template-file chat_template.jinja \
  -c 32768 -ngl 99 \
  --temp 0.6 --top-k 20 --top-p 0.8 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --host 0.0.0.0 --port 8080

Open WebUI

Mount the template via Docker:

volumes:
  - ./chat_template.jinja:/templates/chat_template.jinja:ro
command: >
  --chat-template-file /templates/chat_template.jinja

vLLM

vllm serve Qwen/Qwen3.5-35B-A3B \
  --chat-template ./chat_template.jinja

Ollama

Copy chat_template.jinja into your Modelfile or use with a compatible frontend.

Configuration

Pass via --chat-template-kwargs:

{
  "enable_thinking": true,
  "auto_disable_thinking_with_tools": true,
  "add_vision_id": false,
  "max_tool_arg_chars": 0,
  "max_tool_response_chars": 8192
}

| Variable | Default | Description | |----------|---------|-------------| | enable_thinking | true | Controls <think> mode | | auto_disable_thinking_with_tools | true | Auto-disables thinking when tools are provided to prevent <tool_call> bleed into <think> blocks | | add_vision_id | false | Prefix images/videos with "Picture N:" / "Video N:" | | max_tool_arg_chars | 0 (unlimited) | Truncate tool arguments beyond this length | | max_tool_response_chars | 0 (unlimited) | Truncate tool responses beyond this length |

Before / After

Tool Call Bleed Bug (Fix 19)

Before (official template):

<think>
The user wants me to search for...
<tool_call>          ← WRONG: tool call inside think block
<function=search>

After (this template):

<think>

</think>

<tool_call>          ← Correct: thinking auto-disabled when tools present
<function=search>

Parallel Tool Calls (Fix 15)

Before (official template):

<tool_call><function=multiply>...</function></tool_call><tool_call><function=add>...</function></tool_call>

After (this template):

<tool_call>
<function=multiply>
...
</function>
</tool_call>

<tool_call>
<function=add>
...
</function>
</tool_call>

Compatible Models

Tested and compatible with all Qwen 3.5 models:

  • Qwen3.5-35B-A3B (all quants)
  • Qwen3.5-27B-A3B
  • Qwen3.5-14B-A3B
  • Qwen3.5-9B
  • Qwen3.5-4B
  • Qwen3.5-Coder series

Also backward-compatible with Qwen3 32B.

Tested Platforms

  • βœ… llama.cpp (b4242+)
  • βœ… Open WebUI (v0.4.8+)
  • βœ… vLLM (v0.6.4+)
  • βœ… Ollama (v0.5.0+)
  • βœ… LM Studio (v0.3.5+)
  • βœ… Text Generation WebUI (oobabooga)

Credits

Base template architecture: Qwen team (official Qwen3.5 chat template)

All 21 fixes: Barubary (original implementations)

License

Apache 2.0 (same as the official Qwen3.5 template)

Contributing

Found a bug? Open an issue with:

  • Minimal reproduction case
  • Error logs
  • Model and runtime versions

Pull requests welcome.

Author: barubary

Likes: 3

Downloads: 0

Tags: qwen, chat-template, jinja, qwen3.5, llama-cpp, open-webui, vllm, tool-calling, streaming, text-generation, conversational, en, base_model:Qwen/Qwen3.5-35B-A3B, base_model:finetune:Qwen/Qwen3.5-35B-A3B, license:apache-2.0, region:us

alibaba-pai/AgenticQwen-30B-A3B

Author: alibaba-pai

Likes: 3

Downloads: 0

Tags: region:us