Todays AI Summary

AI Developments: Personalized Agents, Enhanced Video Generation, and Model Safety

Here's a look at the latest AI advancements, covering personalized language models, improvements in video generation, and techniques for model safety.

Research Highlights

  • PersonaAgent with GraphRAG: A new framework for personalized AI agents is introduced, leveraging knowledge graphs to capture user preferences and historical behaviors. By combining user-specific summaries with global interaction patterns, the agent can maintain consistent, persona-aligned behaviors. The method shows significant improvements on the LaMP benchmark, enhancing news categorization, movie tagging, and product rating accuracy.
  • Planning with Sketch-Guided Verification for Physics-Aware Video Generation: A novel training-free framework called SketchVerify improves motion planning quality in video generation. It uses a vision-language verifier to evaluate the semantic alignment and physical plausibility of candidate motion plans, rendering trajectories as lightweight video sketches for efficient scoring. Experiments on WorldModelBench and PhyWorldBench demonstrate improved motion quality, physical realism, and long-term consistency.
  • Masked-and-Reordered Self-Supervision for Reinforcement Learning from Verifiable Rewards: The paper introduces MR-RLVR, a method that constructs process-level self-supervised rewards via "masked-then-fill" and "step reordering" to extract learnable signals from intermediate reasoning. Implemented on Qwen2.5-3B and DeepSeek-R1-Distill-Qwen-1.5B, MR-RLVR achieves average relative gains over the original RLVR of +9.86% Pass@1, +5.27% Pass@5, and +4.00% Pass@8.

Model Releases

  • ArliAI/GLM-4.5-Air-Derestricted: Arli AI has released a "derestricted" version of GLM-4.5-Air, aiming to remove refusal behaviors while maintaining high-performance reasoning. The model utilizes Norm-Preserving Biprojected Abliteration, a technique that preserves the magnitude of neurons during the uncensoring process, potentially improving reasoning capabilities.
  • Phr00t/HunyuanVideo-1.5-Rapid-AIO: This model combines HunyuanVideo 1.5 with VAE and accelerators for faster and easier video generation. It supports both text-to-video (T2V) and image-to-video (I2V) tasks in a single model, using FP8 precision for efficiency.
  • lightx2v/Hy1.5-Distill-Models: This repository contains 4-step distilled models for HunyuanVideo-1.5, optimized for use with LightX2V. These distilled models enable ultra-fast 4-step inference without CFG (Classifier-Free Guidance), significantly reducing generation time while maintaining high-quality video output.

Key Takeaways

  • Personalization: AI agents are becoming more personalized through the use of knowledge graphs and user-specific data.
  • Video Generation: Advances in video generation focus on improving motion quality, physical realism, and efficiency through innovative planning and verification techniques.
  • Model Safety: New methods like Norm-Preserving Biprojected Abliteration aim to remove unwanted behaviors from language models without sacrificing their reasoning capabilities.

AI Papers for 2026-02-18

Long Context, Less Focus: A Scaling Gap in LLMs Revealed through Privacy and Personalization

Large language models (LLMs) are increasingly deployed in privacy-critical and personalization-oriented scenarios, yet the role of context length in shaping privacy leakage and personalization effectiveness remains largely unexplored. We introduce a large-scale benchmark, PAPerBench, to systematically study how increasing context length influences both personalization quality and privacy protection in LLMs. The benchmark comprises approximately 29,000 instances with context lengths ranging from 1K to 256K tokens, yielding a total of 377K evaluation questions. It jointly evaluates personalization performance and privacy risks across diverse scenarios, enabling controlled analysis of long-context model behavior. Extensive evaluations across state-of-the-art LLMs reveal consistent performance degradation in both personalization and privacy as context length increases. We further provide a theoretical analysis of attention dilution under context scaling, explaining this behavior as an inherent limitation of soft attention in fixed-capacity Transformers. The empirical and theoretical findings together suggest a general scaling gap in current models -- long context, less focus. We release the benchmark to support reproducible evaluation and future research on scalable privacy and personalization. Code and data are available at https://github.com/SafeRL-Lab/PAPerBench

Rethinking Diffusion Models with Symmetries through Canonicalization with Applications to Molecular Graph Generation

Many generative tasks in chemistry and science involve distributions invariant to group symmetries (e.g., permutation and rotation). A common strategy enforces invariance and equivariance through architectural constraints such as equivariant denoisers and invariant priors. In this paper, we challenge this tradition through the alternative canonicalization perspective: first map each sample to an orbit representative with a canonical pose or order, train an unconstrained (non-equivariant) diffusion or flow model on the canonical slice, and finally recover the invariant distribution by sampling a random symmetry transform at generation time. Building on a formal quotient-space perspective, our work provides a comprehensive theory of canonical diffusion by proving: (i) the correctness, universality and superior expressivity of canonical generative models over invariant targets; (ii) canonicalization accelerates training by removing diffusion score complexity induced by group mixtures and reducing conditional variance in flow matching. We then show that aligned priors and optimal transport act complementarily with canonicalization and further improves training efficiency. We instantiate the framework for molecular graph generation under $S_n \times SE(3)$ symmetries. By leveraging geometric spectra-based canonicalization and mild positional encodings, canonical diffusion significantly outperforms equivariant baselines in 3D molecule generation tasks, with similar or even less computation. Moreover, with a novel architecture Canon, CanonFlow achieves state-of-the-art performance on the challenging GEOM-DRUG dataset, and the advantage remains large in few-step generation.

Hunt Globally: Deep Research AI Agents for Drug Asset Scouting in Investing, Business Development, and Search & Evaluation

Bio-pharmaceutical innovation has shifted: many new drug assets now originate outside the United States and are disclosed primarily via regional, non-English channels. Recent data suggests >85% of patent filings originate outside the U.S., with China accounting for nearly half of the global total; a growing share of scholarly output is also non-U.S. Industry estimates put China at ~30% of global drug development, spanning 1,200+ novel candidates. In this high-stakes environment, failing to surface "under-the-radar" assets creates multi-billion-dollar risk for investors and business development teams, making asset scouting a coverage-critical competition where speed and completeness drive value. Yet today's Deep Research AI agents still lag human experts in achieving high-recall discovery across heterogeneous, multilingual sources without hallucinations. We propose a benchmarking methodology for drug asset scouting and a tuned, tree-based self-learning Bioptic Agent aimed at complete, non-hallucinated scouting. We construct a challenging completeness benchmark using a multilingual multi-agent pipeline: complex user queries paired with ground-truth assets that are largely outside U.S.-centric radar. To reflect real deal complexity, we collected screening queries from expert investors, BD, and VC professionals and used them as priors to conditionally generate benchmark queries. For grading, we use LLM-as-judge evaluation calibrated to expert opinions. We compare Bioptic Agent against Claude Opus 4.6, OpenAI GPT-5.2 Pro, Perplexity Deep Research, Gemini 3 Pro + Deep Research, and Exa Websets. Bioptic Agent achieves 79.7% F1 versus 56.2% (Claude Opus 4.6), 50.6% (Gemini 3 Pro + Deep Research), 46.6% (GPT-5.2 Pro), 44.2% (Perplexity Deep Research), and 26.9% (Exa Websets). Performance improves steeply with additional compute, supporting the view that more compute yields better results.

Cold-Start Personalization via Training-Free Priors from Structured World Models

Cold-start personalization requires inferring user preferences through interaction when no user-specific historical data is available. The core challenge is a routing problem: each task admits dozens of preference dimensions, yet individual users care about only a few, and which ones matter depends on who is asking. With a limited question budget, asking without structure will miss the dimensions that matter. Reinforcement learning is the natural formulation, but in multi-turn settings its terminal reward fails to exploit the factored, per-criterion structure of preference data, and in practice learned policies collapse to static question sequences that ignore user responses. We propose decomposing cold-start elicitation into offline structure learning and online Bayesian inference. Pep (Preference Elicitation with Priors) learns a structured world model of preference correlations offline from complete profiles, then performs training-free Bayesian inference online to select informative questions and predict complete preference profiles, including dimensions never asked about. The framework is modular across downstream solvers and requires only simple belief models. Across medical, mathematical, social, and commonsense reasoning, Pep achieves 80.8% alignment between generated responses and users' stated preferences versus 68.5% for RL, with 3-5x fewer interactions. When two users give different answers to the same question, Pep changes its follow-up 39-62% of the time versus 0-28% for RL. It does so with ~10K parameters versus 8B for RL, showing that the bottleneck in cold-start elicitation is the capability to exploit the factored structure of preference data.

Spectral Convolution on Orbifolds for Geometric Deep Learning

Geometric deep learning (GDL) deals with supervised learning on data domains that go beyond Euclidean structure, such as data with graph or manifold structure. Due to the demand that arises from application-related data, there is a need to identify further topological and geometric structures with which these use cases can be made accessible to machine learning. There are various techniques, such as spectral convolution, that form the basic building blocks for some convolutional neural network-like architectures on non-Euclidean data. In this paper, the concept of spectral convolution on orbifolds is introduced. This provides a building block for making learning on orbifold structured data accessible using GDL. The theory discussed is illustrated using an example from music theory.

On the Semantics of Primary Cause in Hybrid Dynamic Domains

Reasoning about actual causes of observed effects is fundamental to the study of rationality. This important problem has been studied since the time of Aristotle, with formal mathematical accounts emerging recently. We live in a world where change due to actions can be both discrete and continuous, that is, hybrid. Yet, despite extensive research on actual causation, only few recent studies looked into causation with continuous change. Building on recent progress, in this paper we propose two definitions of primary cause in a hybrid action-theoretic framework, namely the hybrid temporal situation calculus. One of these is foundational in nature while the other formalizes causation through contributions, which can then be verified from a counterfactual perspective using a modified ``but-for'' test. We prove that these two definitions are indeed equivalent. We then show that our definitions of causation have some intuitively justifiable properties.

ThermEval: A Structured Benchmark for Evaluation of Vision-Language Models on Thermal Imagery

Vision language models (VLMs) achieve strong performance on RGB imagery, but they do not generalize to thermal images. Thermal sensing plays a critical role in settings where visible light fails, including nighttime surveillance, search and rescue, autonomous driving, and medical screening. Unlike RGB imagery, thermal images encode physical temperature rather than color or texture, requiring perceptual and reasoning capabilities that existing RGB-centric benchmarks do not evaluate. We introduce ThermEval-B, a structured benchmark of approximately 55,000 thermal visual question answering pairs designed to assess the foundational primitives required for thermal vision language understanding. ThermEval-B integrates public datasets with our newly collected ThermEval-D, the first dataset to provide dense per-pixel temperature maps with semantic body-part annotations across diverse indoor and outdoor environments. Evaluating 25 open-source and closed-source VLMs, we find that models consistently fail at temperature-grounded reasoning, degrade under colormap transformations, and default to language priors or fixed responses, with only marginal gains from prompting or supervised fine-tuning. These results demonstrate that thermal understanding requires dedicated evaluation beyond RGB-centric assumptions, positioning ThermEval as a benchmark to drive progress in thermal vision language modeling.

PhyScensis: Physics-Augmented LLM Agents for Complex Physical Scene Arrangement

Automatically generating interactive 3D environments is crucial for scaling up robotic data collection in simulation. While prior work has primarily focused on 3D asset placement, it often overlooks the physical relationships between objects (e.g., contact, support, balance, and containment), which are essential for creating complex and realistic manipulation scenarios such as tabletop arrangements, shelf organization, or box packing. Compared to classical 3D layout generation, producing complex physical scenes introduces additional challenges: (a) higher object density and complexity (e.g., a small shelf may hold dozens of books), (b) richer supporting relationships and compact spatial layouts, and (c) the need to accurately model both spatial placement and physical properties. To address these challenges, we propose PhyScensis, an LLM agent-based framework powered by a physics engine, to produce physically plausible scene configurations with high complexity. Specifically, our framework consists of three main components: an LLM agent iteratively proposes assets with spatial and physical predicates; a solver, equipped with a physics engine, realizes these predicates into a 3D scene; and feedback from the solver informs the agent to refine and enrich the configuration. Moreover, our framework preserves strong controllability over fine-grained textual descriptions and numerical parameters (e.g., relative positions, scene stability), enabled through probabilistic programming for stability and a complementary heuristic that jointly regulates stability and spatial relations. Experimental results show that our method outperforms prior approaches in scene complexity, visual quality, and physical accuracy, offering a unified pipeline for generating complex physical scene layouts for robotic manipulation.

AnchorWeave: World-Consistent Video Generation with Retrieved Local Spatial Memories

Maintaining spatial world consistency over long horizons remains a central challenge for camera-controllable video generation. Existing memory-based approaches often condition generation on globally reconstructed 3D scenes by rendering anchor videos from the reconstructed geometry in the history. However, reconstructing a global 3D scene from multiple views inevitably introduces cross-view misalignment, as pose and depth estimation errors cause the same surfaces to be reconstructed at slightly different 3D locations across views. When fused, these inconsistencies accumulate into noisy geometry that contaminates the conditioning signals and degrades generation quality. We introduce AnchorWeave, a memory-augmented video generation framework that replaces a single misaligned global memory with multiple clean local geometric memories and learns to reconcile their cross-view inconsistencies. To this end, AnchorWeave performs coverage-driven local memory retrieval aligned with the target trajectory and integrates the selected local memories through a multi-anchor weaving controller during generation. Extensive experiments demonstrate that AnchorWeave significantly improves long-term scene consistency while maintaining strong visual quality, with ablation and analysis studies further validating the effectiveness of local geometric conditioning, multi-anchor control, and coverage-driven retrieval.

MAC-AMP: A Closed-Loop Multi-Agent Collaboration System for Multi-Objective Antimicrobial Peptide Design

To address the global health threat of antimicrobial resistance, antimicrobial peptides (AMP) are being explored for their potent and promising ability to fight resistant pathogens. While artificial intelligence (AI) is being employed to advance AMP discovery and design, most AMP design models struggle to balance key goals like activity, toxicity, and novelty, using rigid or unclear scoring methods that make results hard to interpret and optimize. As the capabilities of Large Language Models (LLM) advance and evolve swiftly, we turn to AI multi-agent collaboration based on such models (multi-agent LLMs), which show rapidly rising potential in complex scientific design scenarios. Based on this, we introduce MAC-AMP, a closed-loop multi-agent collaboration (MAC) system for multi-objective AMP design. The system implements a fully autonomous simulated peer review-adaptive reinforcement learning framework that requires only a task description and example dataset to design novel AMPs. The novelty of our work lies in introducing a closed-loop multi-agent system for AMP design, with cross-domain transferability, that supports multi-objective optimization while remaining explainable rather than a 'black box'. Experiments show that MAC-AMP outperforms other AMP generative models by effectively optimizing AMP generation for multiple key molecular properties, demonstrating exceptional results in antibacterial activity, AMP likeliness, toxicity compliance, and structural reliability.

AI Models

xgen-universe/Capybara


license: mit

<p align="center"> <img src="./assets/misc/logo.png" style="width: 80%; height: auto;"/> </p>

CAPYBARA: A Unified Visual Creation Model

<div align="center"> <a href=https://huggingface.co/xgen-universe/Capybara target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20Models-d96902.svg height=22px></a> <a href=https://github.com/xgen-universe/Capybara target="_blank"><img src= https://img.shields.io/badge/Page-bb8a2e.svg?logo=github height=22px></a> <a href="https://inappetent-acrophonically-alison.ngrok-free.dev/" target="_blank"><img src=https://img.shields.io/badge/Gradio-Demo-orange?logo=gradio&logoColor=white height=22px></a> <a href="https://github.com/xgen-universe/Capybara/blob/main/assets/docs/tech_report.pdf" target="_blank"><img src=https://img.shields.io/badge/Technical_Report-PDF-EC1C24?logo=adobeacrobatreader&logoColor=white height=22px></a> <a href="https://pub-7ba77a763b8142cea73b9e48b46830ca.r2.dev/wechat.jpg" target="_blank"><img src=https://img.shields.io/badge/WeChat-07C160?style=flat&logo=wechat&logoColor=white height=22px></a> </div>

Capybara is a unified visual creation model, i.e., a powerful visual generation and editing framework designed for high-quality visual synthesis and manipulation tasks.

The framework leverages advanced diffusion models and transformer architectures to support versatile visual generation and editing capabilities with precise control over content, motion, and camera movements.

<table> <tr> <td align="center"> <video src="https://pub-7ba77a763b8142cea73b9e48b46830ca.r2.dev/demo_video.mp4" controls="controls" style="max-width: 100%;"> </video> <br> <sub>Speech-driven base clips generated by Seedance 2.0. All editing powered by CAPYBARA</sub> </td> </tr> </table>

Key Features:

  • šŸŽ¬ Multi-Task Support: Supports Text-to-Video (T2V), Text-to-Image (T2I), Instruction-based Video-to-Video (TV2V), Instruction-based Image-to-Image (TI2I), and various editing tasks
  • šŸš€ High Performance: Built with distributed inference support for efficient multi-GPU processing

šŸ”„ News

  • [2026.02.17] šŸš€ Initial release v0.1 of the Capybara inference framework supportting generation and instruction-based editing tasks (T2I, T2V, TI2I, TV2V).

šŸ“ TODO List

  • [ ] Release our unified creation model.
  • [ ] Release training code.

šŸ› ļø Installation

We recommend using Anaconda to create an isolated Python environment and recommend using CUDA 12.6:

# Clone the repository
git clone https://github.com/xgen-universe/Capybara.git
cd Capybara

# Create environment
conda create -n capybara python=3.11 -y
conda activate capybara

# Install pytorch (torch 2.6.0 with CUDA 12.6)
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126

# Install dependencies
pip install -r requirements.txt

# [Optional] Install Flash Attention for faster inference
pip install flash_attn --no-build-isolation

šŸ“„ Model Download

Capybara requires the following model components:

| Models | Download Link | Description | | ------------------- | ---------------------------------------------------------------------------- | --------------------------------- | | Model | Huggingface Model | All Mandatory Models | | Rewrite Model | Huggingface Model | Qwen3-VL-8B-Instruct for Rewrite Instruction|

Download the required models and organize them in the following structure:

ckpts/
ā”œā”€ā”€ scheduler/
│   └── scheduler_config.json
ā”œā”€ā”€ text_encoder/
│   ā”œā”€ā”€ byt5-small/
│   ā”œā”€ā”€ Glyph-SDXL-v2/
│   └── llm/
ā”œā”€ā”€ transformer/
│   └── capybara_v01/
ā”œā”€ā”€ vae/
└── vision_encoder/
    └── siglip/

šŸš€ Inference & Quick Start

Capybara supports two inference modes: Single Sample Mode for quick testing with a single input, and Batch Mode for processing multiple samples via CSV files. Both modes support all task types.

We provide example scripts under script/ and example data under assets/ to help you get started quickly:

assets/
ā”œā”€ā”€ examples/           # Example media files
│   ā”œā”€ā”€ img1.jpeg
│   ā”œā”€ā”€ img2.jpeg
│   ā”œā”€ā”€ video1.mp4
│   └── video2.mp4
└── test_data/          # Example CSV files for batch mode
    ā”œā”€ā”€ ti2i_example.csv
    └── tv2v_example.csv

Single Sample Mode

Process a single image or video with a text prompt. See script/test_single_infer.sh for full examples.

Instruction-based Image-to-Image (TI2I):

python inference.py \
    --pretrained_model_name_or_path ./ckpts \
    --media_path ./assets/examples/img1.jpeg \
    --prompt "Change the time to night." \
    --output_path ./results/test_single_output/ti2i \
    --num_inference_steps 50 \
    --task_type ti2i \
    --resolution 720p \
    --rewrite_instruction

Instruction-based Video-to-Video (TV2V):

python inference.py \
    --pretrained_model_name_or_path ./ckpts \
    --media_path ./assets/examples/video1.mp4 \
    --prompt "Replace the monkey with Ultraman. keep the Ultraman's motion matched the original running pose and motion of monkey." \
    --output_path ./results/test_single_output/tv2v \
    --num_inference_steps 50 \
    --num_frames 81 \
    --task_type tv2v \
    --resolution 480p \
    --rewrite_instruction
<details> <summary> More inference examples about generation tasks (T2I/T2V)</summary>

Text-to-Video (T2V):

python inference.py \
    --pretrained_model_name_or_path ./ckpts \
    --prompt "A giant humpback whale and its calf gracefully swim in the crystal-clear, deep blue open ocean." \
    --output_path ./results/test_single_output/t2v \
    --guidance_scale 4 \
    --num_inference_steps 50 \
    --num_frames 81 \
    --task_type t2v \
    --resolution 480p \
    --aspect_ratio "16:9" \
    --rewrite_instruction

Text-to-Image (T2I):

python inference.py \
    --pretrained_model_name_or_path ./ckpts \
    --prompt "A group of five hikers, sitting on the snow mountain." \
    --output_path ./results/test_single_output/t2i \
    --guidance_scale 4 \
    --num_inference_steps 50 \
    --task_type t2i \
    --resolution 720p \
    --aspect_ratio "16:9" \
    --rewrite_instruction
</details>

Batch Mode (CSV)

Process multiple samples using a CSV file. See script/test_infer.sh for a full example.

CSV Format

For editing tasks (TI2I / TV2V), prepare a CSV with img_path/video_path and instruction columns:

img_path,instruction
img1.jpeg,insturction1.
img2.jpeg,insturction2.

The path column holds relative paths to media files (images or videos) under the data root directory.

Example CSV files are provided in assets/test_data/.

Single GPU

python inference.py \
    --pretrained_model_name_or_path ./ckpts \
    --csv_path ./assets/test_data/ti2i_example.csv \
    --data_root_path ./assets/examples \
    --output_path ./results/test_output/ti2i-480p \
    --num_inference_steps 50 \
    --num_frames 81 \
    --task_type ti2i \
    --resolution 720p \
    --rewrite_instruction

Multi-GPU (Distributed)

Use accelerate for distributed inference across multiple GPUs:

accelerate launch --config_file acc_config/accelerate_config.yaml --num_processes 2 inference.py \
    --pretrained_model_name_or_path ./ckpts \
    --csv_path ./assets/test_data/ti2i_example.csv \
    --data_root_path ./assets/examples \
    --output_path ./results/test_output/ti2i-480p \
    --num_inference_steps 50 \
    --num_frames 81 \
    --task_type ti2i \
    --resolution 720p \
    --rewrite_instruction

āš™ļø Configuration Details

Task Types

| Task Type | Description | Input Required | | --------- | ------------------------------------------ | ----------------------- | | t2v | Text-to-Video generation | --prompt | | t2i | Text-to-Image generation | --prompt | | ti2i | Instruction-based Image-to-Image editing | --media_path + --prompt (or CSV) | | tv2v | Instruction-based Video-to-Video editing | --media_path + --prompt (or CSV) |

Key Parameters

| Parameter | Default | Description | | -------------------------- | ----------- | ---------------------------------------------------------- | | --pretrained_model_name_or_path | (required) | Path to the model checkpoint directory | | --task_type | tv2v | Task type: t2v, t2i, ti2i, tv2v | | --resolution | None | Output resolution: 480p, 720p, 1080p | | --aspect_ratio | None | Aspect ratio: 16:9, 9:16, 4:3, 3:4, 1:1 | | --num_frames | 81 | Number of frames to generate (e.g., 81, 101, 121) | | --num_inference_steps | 50 | Number of denoising steps | | --guidance_scale | 1.0 | Text guidance scale for classifier-free guidance | | --num_sample_per_case | 1 | Number of samples to generate per input | | --rewrite_instruction | False | Auto-enhance prompts using Qwen3-VL-8B-Instruct | | --rewrite_model_path | Qwen/Qwen3-VL-8B-Instruct | Path to the rewrite model | | --max_samples | None | Limit the number of samples to process from CSV |

Recommended Settings

For optimal quality and performance, we recommend the following settings:

| Task Type | Recommended Resolution | Recommended Steps | Note | | --------- | --------------------- | ----------------- | -------------------------------------- | | Video (T2V, TV2V) | 480p | 50 | Balanced quality and generation speed | | Image (T2I, TI2I) | 720p | 50 | Higher quality for static images |

Notes:

  • Resolution: You can experiment with higher resolutions (1024 or 1080p).
  • Inference Steps: 50 steps provide a good balance between quality and speed. You can use 30-40 steps for faster generation.

šŸ“„ License

This project is released under the MIT License.

šŸ™ Acknowledgments

This project is built upon:

šŸ“ Citation

If you find Capybara useful for your research, please consider citing:

@misc{capybara2026rao,
  title={Capybara: A Unified Visual Creation Model},
  author={Rao, Zhefan and Che, Haoxuan and Hu, Ziwen and Zou, Bin and Liu, Yaofang and He, Xuanhua and Choi, Chong-Hou and He, Yuyang and Chen, Haoyu and Su, Jingran and Li, Yanheng and Chu, Meng and Lei, Chenyang and Zhao, Guanhua and Li, Zhaoqing and Zhang, Xichen and Li, Anping and Liu, Lin and Tu, Dandan and Liu, Rui},
  year={2026}
}

šŸ“§ Contact

For questions and feedback, please open an issue on GitHub.

You can also contact us by email: zraoac@ust.hk and hche@ust.hk


⭐ If you find this project helpful, please consider giving it a star!

Author: xgen-universe

Likes: 40

Downloads: 0

Tags: diffusers, safetensors, license:mit, region:us

AesSedai/Qwen3.5-397B-A17B-GGUF


base_model:

  • Qwen/Qwen3.5-397B-A17B

This repo contains specialized MoE-quants for Qwen3.5-397B-A17B. The idea being that given the huge size of the FFN tensors compared to the rest of the tensors in the model, it should be possible to achieve a better quality while keeping the overall size of the entire model smaller compared to a similar naive quantization. To that end, the quantization type default is kept in high quality and the FFN UP + FFN GATE tensors are quanted down along with the FFN DOWN tensors.

| Quant | Size | Mixture | PPL | 1-(Mean PPL(Q)/PPL(base)) | KLD | | :--------- | :--------- | :------- | :------- | :------- | :------- | | Q5_K_M | 273.49 GiB (5.93 BPW) | Q8_0 / Q5_K / Q5_K / Q6_K | 4.617400 ± 0.057235 | +0.0156% | 0.002553 ± 0.000078 | | Q4_K_M | 227.55 GiB (4.93 BPW) | Q8_0 / Q4_K / Q4_K / Q5_K | 4.624688 ± 0.057341 | +0.1735% | 0.004496 ± 0.000117 | | IQ4_XS | 176.92 GiB (3.83 BPW) | Q8_0 / IQ3_S / IQ3_S / IQ4_XS | 4.653226 ± 0.057738 | +0.7916% | 0.011963 ± 0.000309 | | IQ3_S | 136.31 GiB (2.95 BPW) | Q6_K / IQ2_S / IQ2_S / IQ3_S | 4.745153 ± 0.059208 | +2.7828% | 0.033163 ± 0.000791 |

kld_graph ppl_graph

Author: AesSedai

Likes: 7

Downloads: 0

Tags: gguf, base_model:Qwen/Qwen3.5-397B-A17B, base_model:quantized:Qwen/Qwen3.5-397B-A17B, endpoints_compatible, region:us, imatrix, conversational

vincentzed-hf/Qwen3.5-397B-A17B-NVFP4


pipeline_tag: image-text-to-text base_model:

  • Qwen/Qwen3.5-397B-A17B license: apache-2.0 library_name: Model Optimizer tags:
  • nvidia
  • ModelOpt
  • Qwen3.5
  • quantized
  • NVFP4
  • nvfp4
  • multimodal
  • vision-language

NVIDIA Qwen3.5-397B-A17B-NVFP4 Model Card

Model Overview

Description:

The NVIDIA Qwen3.5-397B-A17B-NVFP4 model is a quantized version of Qwen's Qwen3.5-397B-A17B model, an autoregressive multimodal language model that uses an optimized Transformer architecture with Mixture of Experts (MoE) and vision-language capabilities. For more information, refer to the Qwen3.5-397B-A17B model card. The NVIDIA Qwen3.5-397B-A17B-NVFP4 model was quantized using the TensorRT Model Optimizer.

This model is ready for commercial/non-commercial use. <br>

Third-Party Community Consideration

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements for this application and use case; see link to Non-NVIDIA (Qwen3.5-397B-A17B) Model Card.

License/Terms of Use:

Apache 2.0

Deployment Geography:

Global <br>

Use Case:

Developers looking to take off the shelf, pre-quantized models for deployment in AI Agent systems, chatbots, RAG systems, and other AI-powered applications. <br>

Release Date:

Huggingface via https://huggingface.co/nvidia/Qwen3.5-397B-A17B-NVFP4 <br>

Model Architecture:

Architecture Type: Transformers (Hybrid) <br> Network Architecture: Qwen3_5MoeForConditionalGeneration <br> Model Details:

  • Total Parameters: 397B
  • Active Parameters: 17B (Sparse Mixture-of-Experts)
  • Hidden Size: 4096
  • Number of Layers: 60 (15 blocks of 3x Gated DeltaNet + 1x Gated Attention, each followed by MoE)
  • Expert Configuration: 512 total experts, 10 activated per token + 1 shared expert, expert intermediate dimension 1024.
  • Gated DeltaNet (Linear Attention): 64 value heads, 16 QK heads, head dimension 128. Provides long-context efficiency.
  • Gated Attention (Full Attention): 32 Q heads, 2 KV heads, head dimension 256, with 64-dim rotary position embedding.
  • Context Window: 262,144 tokens (native), extensible to 1,010,000 tokens with YaRN scaling.
  • Vocabulary Size: 248,320
  • Multi-Token Prediction (MTP): 1 additional prediction layer for speculative decoding.
  • Thinking Mode: Default behavior with <think>...</think> blocks before responses.
  • Multilingual: Supports 201 languages and dialects.
  • Vision Encoder: 27-layer ViT with 1152 hidden size, patch size 16, spatial merge size 2.

Input:

Input Type(s): Text, Image, Video <br> Input Format(s): String, Image, Video <br> Input Parameters: 1D (One-Dimensional): Sequences, 2D (Two-Dimensional): Images, 3D (Three-Dimensional): Video <br>

Output:

Output Type(s): Text <br> Output Format: String <br> Output Parameters: 1D (One-Dimensional): Sequences <br> Other Properties Related to Output: N/A <br>

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br>

Software Integration:

Runtime Engine(s): <br>

  • SGLang <br>

Supported Hardware Microarchitecture Compatibility: <br>

  • NVIDIA Blackwell (B300, B200, RTX PRO 6000 Blackwell) <br>

Preferred Operating System(s): <br>

  • Linux <br>

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment

Model Version(s):

** The model is quantized with nvidia-modelopt 0.42.0rc1.dev21+g421985313 <br>

Training, Testing, and Evaluation Datasets:

Calibration Dataset:

Training Datasets:

  • Data Collection Method by Dataset: Undisclosed <br>
  • Labeling Method by Dataset: Undisclosed<br>
  • Properties: Undisclosed

Testing Dataset:

  • Data Collection Method by Dataset: Undisclosed <br>
  • Labeling Method by Dataset: Undisclosed <br>
  • Properties: Undisclosed <br>

Evaluation Dataset:

  • Data collection method: Hybrid: Automated, Human <br>
  • Labeling method: Hybrid: Human, Automated <br>

Inference:

Acceleration Engine: SGLang <br> Test Hardware: B300 <br>

Recommended Hardware

| GPU | Architecture | VRAM | Memory Type | Memory Bandwidth | TDP | | :--- | :--- | :--- | :--- | :--- | :--- | | NVIDIA B300 (SXM) | Blackwell Ultra (GB110) | 288 GB HBM3e | HBM3e | 4.1 TB/s | 1400 W | | NVIDIA B200 (SXM) | Blackwell (GB100) | 192 GB HBM3e | HBM3e | 4.1 TB/s | 1000 W | | NVIDIA RTX PRO 6000 Blackwell (PCIe) | Blackwell (GB202) | 96 GB GDDR7 | GDDR7 | 1.8 TB/s | 600 W |

B300: 288 GB HBM3e per GPU, 4096-bit memory bus, 18,944 CUDA cores, 592 Tensor Cores, up to 2032 MHz boost. Datacenter SXM module. <br> B200: 192 GB HBM3e per GPU, 4096-bit memory bus, 18,944 CUDA cores, 592 Tensor Cores, up to 1965 MHz boost. Datacenter SXM module. <br> RTX PRO 6000 Blackwell: 96 GB GDDR7 per GPU, 512-bit memory bus, 24,064 CUDA cores, 752 Tensor Cores, up to 2617 MHz boost. Professional workstation GPU (PCIe 5.0 x16). <br>

Post Training Quantization

This model was obtained by quantizing the weights and activations of Qwen3.5-397B-A17B to NVFP4 data type, ready for inference with SGLang. Only the weights and activations of the linear operators within transformer blocks are quantized, as well as the KV-cache to FP8. Vision encoder weights are not quantized. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 4x.

Usage

Deploy with SGLang

The total quantized checkpoint size is ~224GB (397B total parameters, 17B active). On NVIDIA Blackwell GPUs:

| Configuration | GPUs | VRAM per GPU | Total VRAM | Throughput | |:---|:---|:---|:---|:---| | B300 TP=4 | 4x B300 | 288 GB | 1,152 GB | ~120 tok/s | | B300 TP=8 | 8x B300 | 288 GB | 2,304 GB | - | | B200 TP=4 | 4x B200 | 192 GB | 768 GB | - | | B200 TP=8 | 8x B200 | 192 GB | 1,536 GB | - | | RTX PRO 6000 TP=4 | 4x RTX PRO 6000 | 96 GB | 384 GB | - | | RTX PRO 6000 TP=8 | 8x RTX PRO 6000 | 96 GB | 768 GB | - |

The model has a default context length of 262,144 tokens. If you encounter out-of-memory (OOM) errors, consider reducing the context window. However, because Qwen3.5 leverages extended context for complex tasks, we advise maintaining a context length of at least 128K tokens to preserve thinking capabilities.

# TP=4 (recommended, ~120 tok/s on 4x B300)
python3 -m sglang.launch_server \
    --model-path vincentzed-hf/Qwen3.5-397B-A17B-NVFP4 \
    --quantization modelopt_fp4 \
    --tp 4 \
    --context-length 262144 \
    --reasoning-parser qwen3

# TP=8 (if you have less VRAM per GPU, e.g. RTX PRO 6000)
python3 -m sglang.launch_server \
    --model-path vincentzed-hf/Qwen3.5-397B-A17B-NVFP4 \
    --quantization modelopt_fp4 \
    --tp 8 \
    --context-length 262144 \
    --reasoning-parser qwen3

Speculative Decoding (Experimental)

Qwen3.5 includes a built-in Multi-Token Prediction (MTP) head that can be used for speculative decoding via the NEXTN algorithm. This is experimental and may or may not work at this time:

python3 -m sglang.launch_server \
    --model-path vincentzed-hf/Qwen3.5-397B-A17B-NVFP4 \
    --quantization modelopt_fp4 \
    --tp 8 \
    --context-length 262144 \
    --reasoning-parser qwen3 \
    --speculative-algo NEXTN \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4

Installation

Important: You must install SGLang from the bzhng-development:vz/qwen3-5 branch, which includes a fix for the exclusion of visual encoder weights not working properly during quantized inference. Without this fix, the visual weights may be incorrectly handled.

git clone -b vz/qwen3-5 git@github.com:bzhng-development/sglang.git
cd sglang
uv pip install -e "python"
uv pip install transformers==5.2.0

When a release is cut with this fix, we will update this model card.

Reproduce with ModelOpt

You may want to produce this checkpoint yourself. To reproduce the NVFP4 quantized checkpoint using TensorRT Model Optimizer:

python3 examples/llm_ptq/hf_ptq.py \
    --pyt_ckpt_path Qwen/Qwen3.5-397B-A17B \
    --qformat nvfp4 \
    --export_path ./qwen3-5-nvfp4

Note: NVFP4 and FP8 KVCache provides a significant memory footprint reduction (~3.5x vs BF16) with negligible accuracy degradation.

Baseline: Qwen3.5-397B-A17B.

Model Limitations:

The base model was trained on data that contains toxic language and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

Author: vincentzed-hf

Likes: 5

Downloads: 0

Tags: Model Optimizer, safetensors, qwen3_5_moe, nvidia, ModelOpt, Qwen3.5, quantized, NVFP4, nvfp4, multimodal, vision-language, image-text-to-text, conversational, base_model:Qwen/Qwen3.5-397B-A17B, base_model:quantized:Qwen/Qwen3.5-397B-A17B, license:apache-2.0, modelopt, region:us

mratsim/MiniMax-M2.5-FP8-INT4-AWQ


pipeline_tag: text-generation license: other license_name: modified-mit license_link: https://github.com/MiniMax-AI/MiniMax-M2.5/blob/main/LICENSE library_name: llm-compressor tags:

  • fp8
  • awq
  • conversational
  • vllm
  • code
  • devops
  • software engineering
  • engineer
  • developer
  • architect
  • stem
  • agent datasets:
  • HuggingFaceH4/ultrachat_200k
  • databricks/databricks-dolly-15k
  • neuralmagic/calibration
  • HuggingFaceH4/no_robots
  • nvidia/HelpSteer
  • garage-bAInd/Open-Platypus
  • PJMixers/grimulkan_physical-reasoning-ShareGPT
  • PJMixers/grimulkan_theory-of-mind-ShareGPT
  • HuggingFaceH4/Multilingual-Thinking
  • ServiceNow-AI/M2Lingual
  • droussis/euroblocks_sft_1sample_per_lang
  • interstellarninja/hermes_reasoning_tool_use
  • deepmind/code_contests
  • dh02391735/stackoverflow-kubernetes-questions
  • diversoailab/humaneval-rust
  • ammarnasr/the-stack-rust-clean
  • CSJianYang/CodeArena
  • nvidia/OpenCodeInstruct
  • nvidia/Llama-Nemotron-Post-Training-Dataset
  • nvidia/Nemotron-Competitive-Programming-v1
  • rombodawg/code_bagel_hermes-2.5
  • MathArena/project_euler
  • nvidia/Nemotron-Math-Proofs-v1
  • nvidia/OpenMathInstruct-2
  • nvidia/OpenScienceReasoning-2
  • MegaScience/MegaScience
  • OpenMed/Medical-Reasoning-SFT-GPT-OSS-120B
  • ccdv/pubmed-summarization
  • gbharti/finance-alpaca
  • vladlen32230/summarization-yahoo-stock-finance-article-text
  • fka/awesome-chatgpt-prompts
  • theoldmandthesea/17k_business_book
  • ruggsea/stanford-encyclopedia-of-philosophy_instruct
  • mlfoundations-dev/stackexchange_philosophy
  • FreedomIntelligence/SocraticChat
  • Gryphe/Opus-WritingPrompts
  • anthracite-org/nopm_claude_writing_fixed
  • zerofata/Roleplay-Anime-Characters
  • zerofata/Instruct-Anime
  • zerofata/Instruct-Anime-CreativeWriting
  • sam-paech/gutenberg3-generalfiction-scifi-fantasy-romance-adventure-dpo
  • PocketDoc/Dans-Prosemaxx-Adventure
  • anthracite-org/stheno-filtered-v1.1
  • KaraKaraWitch/TvTroper-2025
  • AquaV/US-Army-Survival-Sharegpt
  • AquaV/Interrogation-Sharegpt
  • AquaV/Multi-Environment-Operations-Sharegpt
  • AquaV/Resistance-Sharegpt
  • PocketDoc/Dans-Kinomaxx-VanillaBackrooms base_model:
  • MiniMaxAI/MiniMax-M2.5

MiniMax M2.5 (Mixed-Precision FP8 + INT4 AWQ FrankenQuant)

This strives to be the highest quality quant that can run on 192GiB VRAM

[!TIP]
šŸ’” A non-FP8 version is available at mratsim/MiniMax-M2.5-BF16-INT4-AWQ
That version is compatible with 8x RTX 3090s and with SGLang (which doesn't support mixed quantization yet) for an extra 3GiB in VRAM.
This FP8+INT4 AWQ was build by merging the original FP8 self-attention weights and mratsim/MiniMax-M2.5-BF16-INT4-AWQ experts.

It features:

  • That model has ensured that all experts are calibrated, not doing so is extremely detrimental, PR: https://github.com/vllm-project/llm-compressor/pull/2171

    <details> <summary>šŸ’”**[Click me!]** Visual showcase of why ensuring quantization of all MoE experts is important</summary>
    • Source: https://avtc.github.io/aquarium-side-by-side/
    • Context: https://github.com/ModelCloud/GPTQModel/pull/2235

    image

    </details>
  • Mixed precision with:

    • self-attention weights copied directly from the official version (default FP8 with 2D-blocks)
    • experts weights quantized using AWQ W4A16G32 scheme (4-bit weights, 16-bit activations, scaling factor per group of 32 weights)
  • High-quality large and diverse dataset with programming and devops focus as well as domain-specific knowledge (math, sciences, medical, finance, business, humanities, philosophy, creative writing), general knowledge, pop culture and behavioral situations because we never code in a vacuum. And we want to make sure all experts are calibrated to the full range of their activations.

  • Calibration explicitly tests multilingual capabilities:

    • Asia: Chinese, Hindi, Korean, Japanese
    • Europe: French, German, Portuguese, Russian, Spanish
    • Middle-East: Arabic, Hebrew, Turkish
  • Calibration explicitly tests 60 programming languages and not just Python:

    • Imperative programming: C, C++, Go, Zig, ...
    • Functional programming: Haskell, F#, OCaml, Erlang, Lisp, Clojure ...
    • Web-focused: HTML/CSS, Typescript, PHP, ...
    • Mixed paradigm: D, Kotlin, Nim, Rust, Swift, ...
    • Theorem provers: Coq, Lean
    • Low-level: ARM64 assembly, x86-64 assembly, LLVM IR
    • GPU Programming: Cuda, Vulkan, Apple Metal
    • Game Programming: GDScript, GLSL
    • Domain-specific: MATLAB, Julia, Solidity, R
  • Calibration tries to ensure coverage for a wide variety of experience (from explaining concepts to your grandmother to debugging Kubernetes logs)

  • Built by a dev, for devs (and it looks very good for STEM as well)

It uses my new declarative quantization framework https://github.com/mratsim/quantizers which facilitates highly-tuned calibration sets: calibrate_software_engineer.yaml

<details> <summary>This has taken several days and contribution and bug reports to the ecosystem, I hope you find it useful.</summary>
  • https://github.com/vllm-project/llm-compressor/pull/2171
  • https://github.com/vllm-project/llm-compressor/issues/2172
  • https://github.com/vllm-project/vllm/issues/31623
  • https://github.com/sgl-project/sglang/issues/16276
  • https://github.com/sgl-project/sglang/issues/16295
</details>

šŸ“„ Usage & Running Instructions

The model was tested with vLLM + 2x RTX Pro 6000, here is a script suitable for such configuration with the maximum 196,608 context length. This uses 92.5GiB of VRAM with the flashinfer backend.

[!WARNING]
āš ļø Due to rope_parameters change, at the moment this model is incompatible with transformers V5.
This makes it incompatible with GLM-4.6V which requires transformers V5. Use different Docker images.

[!WARNING]
āš ļø SGLang does not support this model due to missing mixed precision support. Feature request raised at https://github.com/sgl-project/sglang/issues/16276.

Please use mratsim/MiniMax-M2.5-BF16-INT4-AWQ in the meantime.

Running script

--trust-remote-code is necessary until the transformers team merges github.com/huggingface/transformers/pull/42028

You have 2 reasoning parsers;

  • minimax_m2, puts the reasoning content in a special field like DeepSeek models that is usually rendered in a specific manner in frontends.
  • minimax_m2_append_think, puts the reasoning into <think>reasoning_content</think> and that is sent as normal text. Few frontends properly render that, I'm aware of Cherry Studio on Desktop and ChatterUI on Android.

The reason why minimax_m2_append_think was introduced was Interleaved Thinking and having the model build upon it's previous thinking (usually frontends discard the thinking trace)

[!TIP]
šŸ’”In MiniMax-M2.1 with the recommended parameters the model tended to get stuck in repetition loops in vLLM
It seemed like repetition_penalty: 1.10, frequency_penalty: 0.40 avoided that.

You may want to try recommended settings without repetition_penalty first (and it slows down token generation)

# Model configuration (Mandatory)
MODEL="mratsim/MiniMax-M2.5-FP8-INT4-AWQ"
MODELNAME="MiniMax-M2.5"
GPU_UTIL=0.93
SAMPLER_OVERRIDE='{"temperature": 1, "top_p": 0.95, "top_k": 40, "repetition_penalty": 1.1, "frequency_penalty": 0.40}'

# Prevent memory fragmentation
export PYTORCH_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512

# Prevent vLLM from using 100% CPU when idle (Very Recommended)
export VLLM_SLEEP_WHEN_IDLE=1

vllm serve "${MODEL}" \
  --served-model-name "${MODELNAME}" \
  --trust-remote-code \
  --gpu-memory-utilization ${GPU_UTIL} \
  --tp 2 \
  --override-generation-config "${SAMPLER_OVERRIDE}" \
  --enable-auto-tool-choice \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2
  # --reasoning-parser minimax_m2_append_think

Performance

On dual RTX Pro 6000, I can reach over 5500 prefill/prompt/context processing and over 100 tok/s token generation for a single request.

image

With PagedAttention in action you can reach over 25000 tok/s in prompt processing speed.

image

When batching, with default config, you can reach over 6000 even 8000 tok/s and 1200 tok/s generation speed.
Tune prefill vs decode prioritization with --max_num_batched_tokens see Performance & Tuning | vLLM

image

In a steady state with interleaved prefill and decode requests that interrupt each other, you can get ~2400 tok/s context processing and 800 tok/s generation

image

Note: vLLM supports prefill-decode disaggregation for high throughput serving if you have double the minimum hardware:

  • https://pytorch.org/blog/disaggregated-inference-at-scale-with-pytorch-vllm/
  • https://github.com/vllm-project/production-stack
    • Prefill/decode disaggregation
    • Multi-Tier KV-cache via LMCache (GPU > CPU > Local Disk)
    • Cache aware router
    • Multi-model dispatch via single interface

šŸ”¬ Quantization method

Quantization was quite complex for this model and was done in 3 steps:

  1. Original weights are in FP8, they were dequantized to FP16 due to llm-compressor not being able to process FP8.
  2. llm-compressor was used to quantize the MLP experts projection using AWQ, with PR #2171 to ensure they were all activated.
  3. Stitching the FrankenQuant: I combined the original weights, including the 2D-block FP8, with the experts-only AWQ weights.

The llmcompressor library was used with the following recipe:

default_stage:
  default_modifiers:
    AWQModifier:
      config_groups:
        mlp_experts_projections:
          # Include only MLP expert weights for 4-bit quantization
          targets: ["re:.*block_sparse_moe\\.experts\\.\\d+\\.(w1|w2|w3)$"]
          weights:
            num_bits: 4
            type: int
            symmetric: true
            group_size: 32
            strategy: group
            dynamic: false
            # actorder: group
            observer: memoryless_minmax

      mappings:
        - smooth_layer: re:.*post_attention_layernorm$
          balance_layers: ["re:.*w1$", "re:.*w3$"]
        - smooth_layer: re:.*w3$
          balance_layers: ["re:.*w2$"]
      duo_scaling: true

The calibration set had 590 examples, 8192 sequence length, 60 programming languages, 12 spoken languages and is detailed at calibrate_software_engineer.yaml

Quantization theory and heuristics for manual tuning

<details> <summary>In-depth overview of quantization theory and heuristics for manual tuning</summary>

Layers to quantize

Quantization should be focused on Linear layers (also called Dense or Fully-Connected layers i.e. MatMul+Bias) In particular quantizing LayerNorm/RMSnorm layer is strongly discouraged, see [1]

LayerNorm in Quantization. Kovaleva et al. (2021); Wei et al. (2022) find that outliers in the LayerNorm parameters of BERT (Devlin et al., 2019) cause difficulties in model compression. Given the importance of LayerNorm, all the quantization methods we discuss above leave LayerNorm unquantized.

This is also reported in Intel and Nvidia repo:

  • https://github.com/intel/neural-compressor/issues/1963#issuecomment-2274873441
  • https://github.com/NVIDIA/TensorRT/issues/4084#issuecomment-2294513950

Tensors to up-quantize

If there is enough bits, down projections should be prioritized.

According to [4]

Fig. 3: Maximum absolute value over layers for a LLaMA3-8B. Each color represent a different projection and we clearly see that down_proj has the biggest spikes in input and output. We also observe that RMSNorm propagate spikes through the entire model

According to [5]

Figure 5(a) illustrates the extremal ratio across layers and modules in LLaMA2-7B, highlighting that weight outliers are concentrated in the down-projection matrices Wdown ā„“ of the second layer and the last two layers. Figures 5(b) and 5(c) provide detailed visualizations of these outliers in the last two layers.

Mixture-of-Experts quantization (MoE)

Mixture-of-Experts require specific quantization techniques.

Mixed-precision quantization

Some layers have a higher impact on LLM performance. According to [2], spending more bits in attention layers results in large gain compared to spending them in FFN layers. According to [3] on 2-bit quantization:

  • quantizing expert FFN layers do not seriously impact model quality
  • quantizing cross-attention has some impact
  • quantizing self-attention has a large impact
  • quantizing dense FFN has a very significant impact

Hence to preserve model quality we should choose not to quantize dense FFN layers and self-attention layers.

We notice that:

  • official MXFP4 weights of gpt-oss-120b from OpenAI keep self-attention in BF16:
    • https://huggingface.co/openai/gpt-oss-120b/blob/main/model.safetensors.index.json
  • NVFP4 weights of DeepSeek-R1 quantized by Nvidia also keep self-attention in BF16:
    • https://huggingface.co/nvidia/DeepSeek-R1-0528-FP4/blob/main/model.safetensors.index.json

Layers with high-impact

According to [2], giving more bits to the first k blocks have a significantly higher impact on model quality than for the same last k blocks.

Expert quantization

When quantizing MoE, quantizing activations is tricky as only a subset of experts are activated per request. You have to make sure all experts are calibrated.

<details> <summary>Visual showcase of why ensuring quantization of all MoE experts is important</summary>
  • Source: https://avtc.github.io/aquarium-side-by-side/
  • Context: https://github.com/ModelCloud/GPTQModel/pull/2235

image

</details>

References

  1. Why Do Some Inputs Break Low-Bit LLM Quantization? (2025)
    Ting-Yun Chang, Muru Zhang, Jesse Thomason, Robin Jia
    https://arxiv.org/pdf/2506.12044

  2. Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark (2024)
    Pingzhi Li, Xiaolong Jin, Yu Cheng, Tianlong Chen
    https://arxiv.org/pdf/2406.08155v1

  3. Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness (2023)
    Young Jin Kim, Raffy Fahim, Hany Hassan Awadalla
    https://arxiv.org/pdf/2310.02410

  4. Precision Where It Matters: A Novel Spike
    Aware Mixed-Precision Quantization Strategy for
    LLaMA-based Language Models (2025)
    Lucas Maisonnave, Cyril Moineau, Olivier Bichler, and Fabrice Rastello
    https://arxiv.org/pdf/2504.21553

  5. Systematic Outliers in Large Language Models (2025)
    Yongqi An, Xu Zhao, Tao Yu, Ming Tang, Jinqiao Wang
    https://arxiv.org/pdf/2502.06415v2

</details>

Author: mratsim

Likes: 5

Downloads: 0

Tags: llm-compressor, safetensors, minimax_m2, fp8, awq, conversational, vllm, code, devops, software engineering, engineer, developer, architect, stem, agent, text-generation, custom_code, dataset:HuggingFaceH4/ultrachat_200k, dataset:databricks/databricks-dolly-15k, dataset:neuralmagic/calibration, dataset:HuggingFaceH4/no_robots, dataset:nvidia/HelpSteer, dataset:garage-bAInd/Open-Platypus, dataset:PJMixers/grimulkan_physical-reasoning-ShareGPT, dataset:PJMixers/grimulkan_theory-of-mind-ShareGPT, dataset:HuggingFaceH4/Multilingual-Thinking, dataset:ServiceNow-AI/M2Lingual, dataset:droussis/euroblocks_sft_1sample_per_lang, dataset:interstellarninja/hermes_reasoning_tool_use, dataset:deepmind/code_contests, dataset:dh02391735/stackoverflow-kubernetes-questions, dataset:diversoailab/humaneval-rust, dataset:ammarnasr/the-stack-rust-clean, dataset:CSJianYang/CodeArena, dataset:nvidia/OpenCodeInstruct, dataset:nvidia/Llama-Nemotron-Post-Training-Dataset, dataset:nvidia/Nemotron-Competitive-Programming-v1, dataset:rombodawg/code_bagel_hermes-2.5, dataset:MathArena/project_euler, dataset:nvidia/Nemotron-Math-Proofs-v1, dataset:nvidia/OpenMathInstruct-2, dataset:nvidia/OpenScienceReasoning-2, dataset:MegaScience/MegaScience, dataset:OpenMed/Medical-Reasoning-SFT-GPT-OSS-120B, dataset:ccdv/pubmed-summarization, dataset:gbharti/finance-alpaca, dataset:vladlen32230/summarization-yahoo-stock-finance-article-text, dataset:fka/awesome-chatgpt-prompts, dataset:theoldmandthesea/17k_business_book, dataset:ruggsea/stanford-encyclopedia-of-philosophy_instruct, dataset:mlfoundations-dev/stackexchange_philosophy, dataset:FreedomIntelligence/SocraticChat, dataset:Gryphe/Opus-WritingPrompts, dataset:anthracite-org/nopm_claude_writing_fixed, dataset:zerofata/Roleplay-Anime-Characters, dataset:zerofata/Instruct-Anime, dataset:zerofata/Instruct-Anime-CreativeWriting, dataset:sam-paech/gutenberg3-generalfiction-scifi-fantasy-romance-adventure-dpo, dataset:PocketDoc/Dans-Prosemaxx-Adventure, dataset:anthracite-org/stheno-filtered-v1.1, dataset:KaraKaraWitch/TvTroper-2025, dataset:AquaV/US-Army-Survival-Sharegpt, dataset:AquaV/Interrogation-Sharegpt, dataset:AquaV/Multi-Environment-Operations-Sharegpt, dataset:AquaV/Resistance-Sharegpt, dataset:PocketDoc/Dans-Kinomaxx-VanillaBackrooms, arxiv:2506.12044, arxiv:2406.08155, arxiv:2310.02410, arxiv:2504.21553, arxiv:2502.06415, base_model:MiniMaxAI/MiniMax-M2.5, base_model:quantized:MiniMaxAI/MiniMax-M2.5, license:other, compressed-tensors, region:us

Arunk25/FireRed-Image-Edit-1.0_comfy_GGUF


base_model:

  • cocorang/FireRed-Image-Edit-1.0-FP8_And_BF16

For test

Author: Arunk25

Likes: 2

Downloads: 0

Tags: gguf, base_model:cocorang/FireRed-Image-Edit-1.0-FP8_And_BF16, base_model:quantized:cocorang/FireRed-Image-Edit-1.0-FP8_And_BF16, region:us

artificialguybr/PIXELART-REDMOND-QWENIMAGE


tags:

  • text-to-image
  • lora
  • diffusers
  • template:diffusion-lora widget:
  • output: url: images/pixarfk64_qwen_lineart_020.png text: '-'
  • output: url: images/pixarfk64_qwen_lineart_018.png text: '-'
  • output: url: images/pixarfk64_qwen_lineart_015.png text: '-'
  • output: url: images/pixarfk64_qwen_lineart_016.png text: '-'
  • output: url: images/pixarfk64_qwen_lineart_012.png text: '-'
  • output: url: images/pixarfk64_qwen_lineart_014.png text: '-'
  • output: url: images/pixarfk64_qwen_lineart_013.png text: '-'
  • output: url: images/pixarfk64_qwen_lineart_011.png text: '-'
  • output: url: images/pixarfk64_qwen_lineart_010.png text: '-'
  • output: url: images/pixarfk64_qwen_lineart_009.png text: '-'
  • output: url: images/pixarfk64_qwen_lineart_008.png text: '-'
  • output: url: images/pixarfk64_qwen_lineart_007.png text: '-'
  • output: url: images/pixarfk64_qwen_lineart_005.png text: '-'
  • output: url: images/pixarfk64_qwen_lineart_006.png text: '-'
  • output: url: images/pixarfk64_qwen_lineart_004.png text: '-'
  • output: url: images/pixarfk64_qwen_lineart_003.png text: '-'
  • output: url: images/pixarfk64_qwen_lineart_002.png text: '-'
  • output: url: images/pixarfk64_qwen_lineart_001.png text: '-'
  • output: url: images/1771227928326__000000000_2.jpg text: '-'
  • output: url: images/1771227882184__000000000_1.jpg text: '-'
  • output: url: images/1771227836023__000000000_0.jpg text: '-' base_model: Qwen/Qwen-Image-2512 instance_prompt: Pixel Art, PixArFK license: apache-2.0

PIXEL ART REDMOND FOR QWEN IMAGE

<Gallery />

Model description

Pixart - Pixel Art LoRA

Acknowledgment

I'm grateful for the GPU time from Redmond.AI that allowed me to make this model!

Description

This LoRA specializes in generating authentic pixel art style images with a nostalgic retro gaming aesthetic. Perfect for creating 8-bit and 16-bit inspired artwork, game assets, and vintage-style illustrations.

The model has a high capacity to generate detailed pixel art scenes including fantasy characters, retro game environments, cute pixel art animals, sci-fi spaceships, and magical landscapes. It works great for indie game developers, retro gaming enthusiasts, and anyone looking to create charming pixel art style images.

Trigger words

Use `Pixel Art, PixArFK` to activate the LoRA style.

Usage

I recommend using this LoRA with ComfyUI for the best results.

Load the LoRA into your ComfyUI workflow.

License

This model is licensed under the Apache License 2.0.


Support & Links

Support My Work

If you find this model useful, please consider supporting my work:

Follow Me

Acknowledgment

GPU time provided by Redmond.AI

Trigger words

You should use Pixel Art to trigger the image generation.

You should use PixArFK to trigger the image generation.

Download model

Download them in the Files & versions tab.

Author: artificialguybr

Likes: 2

Downloads: 0

Tags: diffusers, text-to-image, lora, template:diffusion-lora, base_model:Qwen/Qwen-Image-2512, base_model:adapter:Qwen/Qwen-Image-2512, license:apache-2.0, region:us

artificialguybr/PIXELART-REDMOND-FLUXKLEIN9B


tags:

  • text-to-image
  • lora
  • diffusers
  • template:diffusion-lora widget:
  • output: url: images/pixarfk64_flux_klein_9b_020.png text: '-'
  • output: url: images/pixarfk64_flux_klein_9b_018.png text: '-'
  • output: url: images/pixarfk64_flux_klein_9b_017.png text: '-'
  • output: url: images/pixarfk64_flux_klein_9b_016.png text: '-'
  • output: url: images/pixarfk64_flux_klein_9b_015.png text: '-'
  • output: url: images/pixarfk64_flux_klein_9b_013.png text: '-'
  • output: url: images/pixarfk64_flux_klein_9b_012.png text: '-'
  • output: url: images/pixarfk64_flux_klein_9b_011.png text: '-'
  • output: url: images/pixarfk64_flux_klein_9b_010.png text: '-'
  • output: url: images/pixarfk64_flux_klein_9b_008.png text: '-'
  • output: url: images/pixarfk64_flux_klein_9b_007.png text: '-'
  • output: url: images/pixarfk64_flux_klein_9b_005.png text: '-'
  • output: url: images/pixarfk64_flux_klein_9b_006.png text: '-'
  • output: url: images/pixarfk64_flux_klein_9b_003.png text: '-'
  • output: url: images/pixarfk64_flux_klein_9b_002.png text: '-'
  • output: url: images/pixarfk64_flux_klein_9b_001.png text: '-' base_model: black-forest-labs/FLUX.2-klein-9B instance_prompt: Pixel Art, PixArFK license: apache-2.0

PIXEL ART REDMOND FOR FLUX KLEIN 9B

<Gallery />

Model description

Pixart - Pixel Art LoRA

Acknowledgment

I'm grateful for the GPU time from Redmond.AI that allowed me to make this model!

Description

This LoRA specializes in generating authentic pixel art style images with a nostalgic retro gaming aesthetic. Perfect for creating 8-bit and 16-bit inspired artwork, game assets, and vintage-style illustrations.

The model has a high capacity to generate detailed pixel art scenes including fantasy characters, retro game environments, cute pixel art animals, sci-fi spaceships, and magical landscapes. It works great for indie game developers, retro gaming enthusiasts, and anyone looking to create charming pixel art style images.

Trigger words

Use `Pixel Art, PixArFK` to activate the LoRA style.

Usage

I recommend using this LoRA with ComfyUI for the best results.

Load the LoRA into your ComfyUI workflow.

License

This model is licensed under the Apache License 2.0.


Support & Links

Support My Work

If you find this model useful, please consider supporting my work:

Follow Me

Acknowledgment

GPU time provided by Redmond.AI

Trigger words

You should use Pixel Art to trigger the image generation.

You should use PixArFK to trigger the image generation.

Download model

Download them in the Files & versions tab.

Author: artificialguybr

Likes: 2

Downloads: 0

Tags: diffusers, text-to-image, lora, template:diffusion-lora, base_model:black-forest-labs/FLUX.2-klein-9B, base_model:adapter:black-forest-labs/FLUX.2-klein-9B, license:apache-2.0, region:us

tajshuvo/Bangla-Mistral-7B-Instruct-v0.2


license: mit datasets:

  • md-nishat-008/Bangla-Instruct language:
  • bn base_model:
  • mistralai/Mistral-7B-Instruct-v0.2 pipeline_tag: text-generation library_name: transformers

Author: tajshuvo

Likes: 1

Downloads: 0

Tags: transformers, safetensors, mistral, text-generation, conversational, bn, dataset:md-nishat-008/Bangla-Instruct, base_model:mistralai/Mistral-7B-Instruct-v0.2, base_model:finetune:mistralai/Mistral-7B-Instruct-v0.2, license:mit, text-generation-inference, endpoints_compatible, region:us

Limbicnation/qwen3-4b-deforum-prompt-lora-v4


base_model: Qwen/Qwen3-4B-Instruct-2507 library_name: transformers model_name: qwen3-4b-deforum-prompt-lora-v4 tags:

  • generated_from_trainer
  • trl
  • sft licence: license

Model Card for qwen3-4b-deforum-prompt-lora-v4

This model is a fine-tuned version of Qwen/Qwen3-4B-Instruct-2507. It has been trained using TRL.

Quick start

from transformers import pipeline

question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
generator = pipeline("text-generation", model="Limbicnation/qwen3-4b-deforum-prompt-lora-v4", device="cuda")
output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
print(output["generated_text"])

Training procedure

<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>

This model was trained with SFT.

Framework versions

  • TRL: 0.27.1
  • Transformers: 4.57.6
  • Pytorch: 2.6.0+cu124
  • Datasets: 4.5.0
  • Tokenizers: 0.22.2

Citations

Cite TRL as:

@misc{vonwerra2022trl,
	title        = {{TRL: Transformer Reinforcement Learning}},
	author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
	year         = 2020,
	journal      = {GitHub repository},
	publisher    = {GitHub},
	howpublished = {\url{https://github.com/huggingface/trl}}
}

Author: Limbicnation

Likes: 1

Downloads: 0

Tags: transformers, safetensors, generated_from_trainer, trl, sft, base_model:Qwen/Qwen3-4B-Instruct-2507, base_model:finetune:Qwen/Qwen3-4B-Instruct-2507, endpoints_compatible, region:us

NoesisLab/Spartacus-1B-Instruct


library_name: transformers license: apache-2.0 language:

  • en tags:
  • monoid
  • causal-lm
  • linear-attention
  • state-space
  • O(1)-inference
  • reasoning pipeline_tag: text-generation model-index:
  • name: Spartacus-1B-Instruct results: []

Spartacus-1B-Instruct — Causal Monoid Language Model

A 1.3B parameter language model that replaces softmax attention with causal monoid state compression, achieving O(1) time per token and O(1) memory at inference — regardless of sequence length.

Fine-tuned for enhanced reasoning with structured chain-of-thought data.

Monoid Attention — Internal Structure

                          MonoidAttention (per layer, per head)
 ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
 │                                                                     │
 │   x_t ∈ R^{2048}                                                   │
 │    │                                                                │
 │    ā”œā”€ā”€> q_proj ──> RMSNorm ──> q_t ∈ R^{d}     (query)            │
 │    │                                                                │
 │    ā”œā”€ā”€> k_proj ──> RMSNorm ──> SiLU ──> k_t ∈ R^{d}  (key, >= 0) │
 │    │                                                                │
 │    ā”œā”€ā”€> v_proj ──> v_t ∈ R^{d}                  (value)            │
 │    │                                                                │
 │    └──> decay_proj ──> sigmoid ──> alpha_t ∈ (0,1)  (decay gate)   │
 │                                                                     │
 │         k_t (x) v_t                                                 │
 │            │            ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”            │
 │            │            │  State Matrix S_t ∈ R^{d x d} │            │
 │            v            │                              │            │
 │    S_t = alpha_t * S_{t-1} + k_t (x) v_t              │            │
 │            │            │  "Compressed causal history" │            │
 │            │            ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜            │
 │            v                                                        │
 │    o_t = q_t . S_t ──> o_proj ──> output                           │
 │                                                                     │
 ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

Monoid State Diagonal — O(1) Compression Contour

The state matrix S_t accumulates causal history along its diagonal. Each head maintains an independent d x d state that compresses ALL past tokens into a fixed footprint:

   State Matrix S_t ∈ R^{64 x 64}  (one per head, 32 heads per layer)

   k-dim -->
   0   8  16  24  32  40  48  56  63
   ā”Œā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”  0
   │***│** │*  │   │   │   │   │   │     v-dim
   │***│** │*  │.  │   │   │   │   │      |
   ā”œā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¤  8   |
   │** │***│** │*  │.  │   │   │   │      v
   │*  │***│** │*  │.  │   │   │   │
   ā”œā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¤ 16
   │*  │** │***│** │*  │.  │   │   │
   │.  │*  │***│** │*  │.  │   │   │
   ā”œā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¤ 24
   │   │.  │** │***│** │*  │.  │   │
   │   │   │*  │***│** │*  │.  │   │
   ā”œā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¤ 32
   │   │   │.  │** │***│** │*  │.  │
   │   │   │   │*  │***│** │*  │.  │
   ā”œā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¤ 40
   │   │   │   │.  │** │***│** │*  │
   │   │   │   │   │*  │***│** │*  │
   ā”œā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¤ 48
   │   │   │   │   │.  │** │***│** │
   │   │   │   │   │   │*  │***│** │
   ā”œā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”¤ 56
   │   │   │   │   │   │.  │** │***│
   │   │   │   │   │   │   │*  │***│
   ā””ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”˜ 63

   Legend:  *** = high activation (recent tokens, alpha^0 ~ alpha^2)
            **  = medium (alpha^3 ~ alpha^5)
            *   = fading (alpha^6 ~ alpha^10)
            .   = near-zero (alpha^11+, effectively forgotten)
                = zero (never reached or fully decayed)

   The diagonal band emerges because S_t = SUM_{i<=t} alpha^{t-i} * k_i (x) v_i.
   Recent outer products dominate near the diagonal; older ones decay
   exponentially via alpha, creating this characteristic contour.

Key Properties

| Property | Transformer (Llama) | Spartacus (Monoid) | |---|---|---| | Inference time per token | O(T) -- scans full KV-cache | O(1) -- single state update | | Inference memory per layer | O(T) -- stores all past K,V | O(1) -- fixed d x d state matrix | | Sequence length extrapolation | Degrades beyond training length | Unlimited -- state size is constant | | Causality | Imposed via attention mask | Built into the recurrence | | Training complexity | O(T^2) | O(T) via parallel prefix scan |

The Monoid Recurrence

Standard attention computes:

o_t = sum_{i<=t} softmax(q_t . k_i) v_i    -- requires O(T) KV-cache

Monoid attention compresses the entire causal history into a fixed-size state matrix S_t per head:

S_t = alpha_t * S_{t-1} + k_t (x) v_t      -- explicit causal recurrence
o_t = q_t . S_t                              -- state readout

where alpha_t = sigmoid(decay_proj(x_t)) is a learned, content-dependent decay gate that controls how fast past information fades.

Explicit Causal Modeling

Unlike Transformers where causality is a constraint imposed by masking, Spartacus makes causality a first-class citizen:

  • The decay gate alpha_t explicitly controls per-head information retention at every timestep
  • The model learns when to forget rather than encoding where tokens are (no positional encoding needed)
  • No attention mask required -- causality is structural, not enforced

Design Choices

  • SiLU-activated keys: k = SiLU(k_proj(x)) ensures non-negative keys, making the state matrix S positive semi-definite (PSD). This prevents "feature erasure" where one token's contribution cancels another's
  • Log-space decay: Working in log-space log(alpha) avoids numerical underflow when alpha^T -> 0 for long sequences
  • Learnable h0: The initial state S_0 = h0 is a learnable parameter (zero-initialized), acting as a compressed "system prompt"

Model Details

| Parameter | Value | |---|---| | Model | NoesisLab/Spartacus-1B-Instruct | | Architecture | MonoidForCausalLM | | Parameters | ~1.34B (tied embeddings) | | Hidden size | 2048 | | Intermediate size (MLP) | 8192 | | Layers | 16 | | Attention heads | 32 | | Head dimension | 64 | | State matrix per head | 64 x 64 = 4096 floats | | Vocabulary | 128,256 (Llama-3.2 tokenizer) | | Precision | bfloat16 |

Benchmarks (0-shot)

| Task | Metric | Value | Stderr | |---|---|---|---| | ARC-Challenge | acc_norm | 0.3063 | ±0.0135 | | ARC-Easy | acc | 0.5518 | ±0.0102 | | HellaSwag | acc_norm | 0.4610 | ±0.0050 | | PIQA | acc_norm | 0.6915 | ±0.0108 | | WinoGrande | acc | 0.5225 | ±0.0140 |

Comparison with ~1B Baselines (acc_norm, 0-shot)

| Task | Spartacus-1B-Instruct | TinyLlama-1.1B | Llama 3.2-1B | Mamba-1.4B | RWKV-6-1.6B | |---|---|---|---|---|---| | ARC-C | 0.3063 | 0.3268 | ~0.359 | 0.284 | ~0.301 | | ARC-E | 0.5518 | 0.5547 | ~0.752 | 0.512 | ~0.530 | | HellaSwag | 0.4610 | 0.4670 | ~0.546 | 0.435 | ~0.450 | | PIQA | 0.6915 | 0.7210 | ~0.740 | 0.655 | ~0.670 | | WinoGrande | 0.5225 | 0.5040 | ~0.592 | 0.510 | ~0.515 |

Spartacus achieves competitive performance with sub-quadratic models (Mamba, RWKV) while maintaining O(1) inference time and memory per token. Scores marked with ~ are approximate community-reported values.

Training

Stage 1: General SFT

  • Base weights: Transferred from Llama-3.2-1B-Instruct (embeddings, MLP, norms)
  • Data: Capybara + smol-smoltalk (general conversation)
  • Training: Full-parameter SFT

Stage 2: Reasoning Enhancement

  • Data mix: 60% Qwen3-Short-Reasoning + 20% Capybara + 20% smol-smoltalk
  • Steps: 2,000
  • Learning rate: 2e-5 (cosine schedule, 50 warmup steps)
  • Batch size: 8
  • Sequence length: 2,048
  • Precision: bfloat16
  • Optimizer: AdamW (weight decay 0.01, max grad norm 1.0)

The reasoning data uses structured "Thought + Solution" format to strengthen chain-of-thought capabilities while the general data prevents catastrophic forgetting.

Parallel Scan Implementation

The monoid_scan_cuda.py module provides a Triton JIT-compiled parallel prefix scan:

  • Forward: Sequential scan along T, parallelized across B x H x D on GPU via Triton kernels
  • Backward: Reverse-order adjoint scan computes gradients for both values and log-decay gates
  • Fallback: Pure PyTorch sequential scan for CPU/MPS
  • Auto-dispatch: CUDA -> Triton kernel, otherwise -> PyTorch fallback

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "NoesisLab/Spartacus-1B-Instruct",
    trust_remote_code=True,
    torch_dtype="bfloat16",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("NoesisLab/Spartacus-1B-Instruct")

messages = [{"role": "user", "content": "Hello!"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

File Structure

MonoidForCausalLM.py       # Model architecture (MonoidConfig, MonoidAttention, MonoidForCausalLM)
monoid_scan_cuda.py        # Triton JIT parallel prefix scan + PyTorch fallback
model.safetensors          # Model weights (bfloat16)
config.json                # Model configuration
tokenizer.json             # Llama-3.2 tokenizer

Citation

@software{spartacus2025,
  title={Spartacus: Causal Monoid Language Model with O(1) Inference},
  author={NoesisLab},
  year={2025},
  url={https://huggingface.co/NoesisLab/Spartacus-1B-Instruct},
  description={Replaces softmax attention with monoid state compression for constant-time, constant-memory autoregressive generation}
}

License

Apache 2.0

Author: NoesisLab

Likes: 1

Downloads: 0

Tags: transformers, safetensors, monoid, text-generation, causal-lm, linear-attention, state-space, O(1)-inference, reasoning, conversational, custom_code, en, license:apache-2.0, region:us