Todays AI Summary

AI Developments: Quantization, Long-Context LLMs, and VLA Advancements

Here's a look at some of the interesting developments in AI from today, covering quantization techniques, long-context language models, and vision-language-action models.

Research Papers

Several research papers have been published that introduce new techniques and benchmarks:

  • ButterflyQuant: A paper introduces a novel quantization method called "ButterflyQuant" for large language models (LLMs). It uses learnable butterfly transforms to achieve ultra-low-bit quantization, addressing the performance loss typically associated with extreme quantization. The method demonstrates significant improvements in perplexity compared to existing rotation-based methods like QuaRot.
  • The Illusion of Diminishing Returns: This paper challenges the notion of diminishing returns in scaling LLMs. It highlights that even small improvements in single-step accuracy can lead to significant gains in long-horizon task completion. The paper also identifies a "self-conditioning effect" where models become more prone to errors when their previous mistakes are included in the context.
  • SimpleVLA-RL: This paper introduces an efficient reinforcement learning framework, SimpleVLA-RL, for vision-language-action (VLA) models. It addresses the challenges of data scarcity and limited generalization in VLA models by using RL to improve long-horizon action planning. The framework achieves state-of-the-art performance on LIBERO and outperforms supervised fine-tuning in real-world tasks.
  • LoCoBench: A new benchmark, LoCoBench, is introduced for evaluating long-context LLMs in complex software engineering scenarios. It includes 8,000 evaluation scenarios across 10 programming languages, with context lengths ranging from 10K to 1M tokens. The benchmark aims to assess the ability of LLMs to understand entire codebases, reason across multiple files, and maintain architectural consistency in large-scale software systems.
  • Retrieval-Augmented Generation for Reliable Interpretation of Radio Regulations: This paper introduces a telecom-specific Retrieval-Augmented Generation (RAG) pipeline and evaluation set for question answering in the domain of radio regulations. The approach improves generation accuracy across all tested models, demonstrating the effectiveness of targeted grounding for regulatory question answering.

Models

Several new models have been released, focusing on various applications:

  • wikeeyang/SRPO-Refine-Quantized-v1.0: This model refines and quantizes the Tencent/SRPO model, improving the clarity of generated images and model compatibility. It is based on the paper "Directly Aligning the Full Diffusion Trajectory with Fine-Grained Human Preference" (arxiv:2509.06942). The model has received 10 likes.
  • Qwen3-Next-80B-A3B-Instruct-bnb-4bit: This model is a 4-bit quantized version of Qwen3-Next-80B-A3B-Instruct, featuring hybrid attention, high-sparsity mixture-of-experts, and multi-token prediction. It supports ultra-long context lengths up to 256K tokens and performs competitively on various benchmarks.
  • QuantFactory/UIGEN-X-8B-GGUF: This is a quantized version of Tesslate/UIGEN-X-8B, a hybrid reasoning UI generation model built on the Qwen3-8B architecture. It is trained to systematically plan, architect, and implement complete user interfaces across modern development stacks.

Key Takeaways

  • Quantization Advances: ButterflyQuant demonstrates a promising approach to reducing the memory footprint of LLMs through learnable orthogonal transforms, potentially enabling deployment on consumer hardware.
  • Long-Context LLM Evaluation: LoCoBench provides a valuable benchmark for assessing the capabilities of long-context LLMs in complex software engineering tasks, highlighting the challenges and opportunities in this area.
  • VLA Model Improvement: SimpleVLA-RL offers an effective RL framework for enhancing VLA models, addressing data scarcity and generalization issues in robotic manipulation.
  • Qwen3 Models: The Qwen3 family of models continues to be a popular choice for various applications, with new models like Qwen3-Next-80B-A3B-Instruct pushing the boundaries of context length and performance.
  • UI Generation Models: Models like UIGEN-X-8B and UIGEN-FX-4B-Preview are making strides in AI-driven UI development, offering the potential to automate and accelerate the creation of user interfaces.

AI Papers for 2026-03-28

Vega: Learning to Drive with Natural Language Instructions

Vision-language-action models have reshaped autonomous driving to incorporate languages into the decision-making process. However, most existing pipelines only utilize the language modality for scene descriptions or reasoning and lack the flexibility to follow diverse user instructions for personalized driving. To address this, we first construct a large-scale driving dataset (InstructScene) containing around 100,000 scenes annotated with diverse driving instructions with the corresponding trajectories. We then propose a unified Vision-Language-World-Action model, Vega, for instruction-based generation and planning. We employ the autoregressive paradigm to process visual inputs (vision) and language instructions (language) and the diffusion paradigm to generate future predictions (world modeling) and trajectories (action). We perform joint attention to enable interactions between the modalities and use individual projection layers for different modalities for more capabilities. Extensive experiments demonstrate that our method not only achieves superior planning performance but also exhibits strong instruction-following abilities, paving the way for more intelligent and personalized driving systems.

Drive My Way: Preference Alignment of Vision-Language-Action Model for Personalized Driving

Human driving behavior is inherently personal, which is shaped by long-term habits and influenced by short-term intentions. Individuals differ in how they accelerate, brake, merge, yield, and overtake across diverse situations. However, existing end-to-end autonomous driving systems either optimize for generic objectives or rely on fixed driving modes, lacking the ability to adapt to individual preferences or interpret natural language intent. To address this gap, we propose Drive My Way (DMW), a personalized Vision-Language-Action (VLA) driving framework that aligns with users' long-term driving habits and adapts to real-time user instructions. DMW learns a user embedding from our personalized driving dataset collected across multiple real drivers and conditions the policy on this embedding during planning, while natural language instructions provide additional short-term guidance. Closed-loop evaluation on the Bench2Drive benchmark demonstrates that DMW improves style instruction adaptation, and user studies show that its generated behaviors are recognizable as each driver's own style, highlighting personalization as a key capability for human-centered autonomous driving. Our data and code are available at https://dmw-cvpr.github.io/.

Training the Knowledge Base through Evidence Distillation and Write-Back Enrichment

The knowledge base in a retrieval-augmented generation (RAG) system is typically assembled once and never revised, even though the facts a query requires are often fragmented across documents and buried in irrelevant content. We argue that the knowledge base should be treated as a trainable component and propose WriteBack-RAG, a framework that uses labeled examples to identify where retrieval succeeds, isolate the relevant documents, and distill them into compact knowledge units that are indexed alongside the original corpus. Because the method modifies only the corpus, it can be applied once as an offline preprocessing step and combined with any RAG pipeline. Across four RAG methods, six benchmarks, and two LLM backbones, WriteBack-RAG improves every evaluated setting, with gains averaging +2.14%. Cross-method transfer experiments further show that the distilled knowledge benefits RAG pipelines other than the one used to produce it, confirming that the improvement resides in the corpus itself.

PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference

Autoregressive video diffusion models have demonstrated remarkable progress, yet they remain bottlenecked by intractable linear KV-cache growth, temporal repetition, and compounding errors during long-video generation. To address these challenges, we present PackForcing, a unified framework that efficiently manages the generation history through a novel three-partition KV-cache strategy. Specifically, we categorize the historical context into three distinct types: (1) Sink tokens, which preserve early anchor frames at full resolution to maintain global semantics; (2) Mid tokens, which achieve a massive spatiotemporal compression (32x token reduction) via a dual-branch network fusing progressive 3D convolutions with low-resolution VAE re-encoding; and (3) Recent tokens, kept at full resolution to ensure local temporal coherence. To strictly bound the memory footprint without sacrificing quality, we introduce a dynamic top-$k$ context selection mechanism for the mid tokens, coupled with a continuous Temporal RoPE Adjustment that seamlessly re-aligns position gaps caused by dropped tokens with negligible overhead. Empowered by this principled hierarchical context compression, PackForcing can generate coherent 2-minute, 832x480 videos at 16 FPS on a single H200 GPU. It achieves a bounded KV cache of just 4 GB and enables a remarkable 24x temporal extrapolation (5s to 120s), operating effectively either zero-shot or trained on merely 5-second clips. Extensive results on VBench demonstrate state-of-the-art temporal consistency (26.07) and dynamic degree (56.25), proving that short-video supervision is sufficient for high-quality, long-video synthesis. https://github.com/ShandaAI/PackForcing

PixelSmile: Toward Fine-Grained Facial Expression Editing

Fine-grained facial expression editing has long been limited by intrinsic semantic overlap. To address this, we construct the Flex Facial Expression (FFE) dataset with continuous affective annotations and establish FFE-Bench to evaluate structural confusion, editing accuracy, linear controllability, and the trade-off between expression editing and identity preservation. We propose PixelSmile, a diffusion framework that disentangles expression semantics via fully symmetric joint training. PixelSmile combines intensity supervision with contrastive learning to produce stronger and more distinguishable expressions, achieving precise and stable linear expression control through textual latent interpolation. Extensive experiments demonstrate that PixelSmile achieves superior disentanglement and robust identity preservation, confirming its effectiveness for continuous, controllable, and fine-grained expression editing, while naturally supporting smooth expression blending.

Back to Basics: Revisiting ASR in the Age of Voice Agents

Automatic speech recognition (ASR) systems have achieved near-human accuracy on curated benchmarks, yet still fail in real-world voice agents under conditions that current evaluations do not systematically cover. Without diagnostic tools that isolate specific failure factors, practitioners cannot anticipate which conditions, in which languages, will cause what degree of degradation. We introduce WildASR, a multilingual (four-language) diagnostic benchmark sourced entirely from real human speech that factorizes ASR robustness along three axes: environmental degradation, demographic shift, and linguistic diversity. Evaluating seven widely used ASR systems, we find severe and uneven performance degradation, and model robustness does not transfer across languages or conditions. Critically, models often hallucinate plausible but unspoken content under partial or degraded inputs, creating concrete safety risks for downstream agent behavior. Our results demonstrate that targeted, factor-isolated evaluation is essential for understanding and improving ASR reliability in production systems. Besides the benchmark itself, we also present three analytical tools that practitioners can use to guide deployment decisions.

Natural-Language Agent Harnesses

Agent performance increasingly depends on \emph{harness engineering}, yet harness design is usually buried in controller code and runtime-specific conventions, making it hard to transfer, compare, and study as a scientific object. We ask whether the high-level control logic of an agent harness can instead be externalized as a portable executable artifact. We introduce \textbf{Natural-Language Agent Harnesses} (NLAHs), which express harness behavior in editable natural language, and \textbf{Intelligent Harness Runtime} (IHR), a shared runtime that executes these harnesses through explicit contracts, durable artifacts, and lightweight adapters. Across coding and computer-use benchmarks, we conduct controlled evaluations of operational viability, module ablation, and code-to-text harness migration.

R-C2: Cycle-Consistent Reinforcement Learning Improves Multimodal Reasoning

Robust perception and reasoning require consistency across sensory modalities. Yet current multimodal models often violate this principle, yielding contradictory predictions for visual and textual representations of the same concept. Rather than masking these failures with standard voting mechanisms, which can amplify systematic biases, we show that cross-modal inconsistency provides a rich and natural signal for learning. We introduce RC2, a reinforcement learning framework that resolves internal conflicts by enforcing cross-modal cycle consistency. By requiring a model to perform backward inference, switch modalities, and reliably reconstruct the answer through forward inference, we obtain a dense, label-free reward. This cyclic constraint encourages the model to align its internal representations autonomously. Optimizing for this structure mitigates modality-specific errors and improves reasoning accuracy by up to 7.6 points. Our results suggest that advanced reasoning emerges not only from scaling data, but also from enforcing a structurally consistent understanding of the world.

Agent Factories for High Level Synthesis: How Far Can General-Purpose Coding Agents Go in Hardware Optimization?

We present an empirical study of how far general-purpose coding agents -- without hardware-specific training -- can optimize hardware designs from high-level algorithmic specifications. We introduce an agent factory, a two-stage pipeline that constructs and coordinates multiple autonomous optimization agents. In Stage~1, the pipeline decomposes a design into sub-kernels, independently optimizes each using pragma and code-level transformations, and formulates an Integer Linear Program (ILP) to assemble globally promising configurations under an area constraint. In Stage~2, it launches $N$ expert agents over the top ILP solutions, each exploring cross-function optimizations such as pragma recombination, loop fusion, and memory restructuring that are not captured by sub-kernel decomposition. We evaluate the approach on 12 kernels from HLS-Eval and Rodinia-HLS using Claude Code (Opus~4.5/4.6) with AMD Vitis HLS. Scaling from 1 to 10 agents yields a mean $8.27\times$ speedup over baseline, with larger gains on harder benchmarks: streamcluster exceeds $20\times$ and kmeans reaches approximately $10\times$. Across benchmarks, agents consistently rediscover known hardware optimization patterns without domain-specific training, and the best designs often do not originate from top-ranked ILP candidates, indicating that global optimization exposes improvements missed by sub-kernel search. These results establish agent scaling as a practical and effective axis for HLS optimization.

Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models

Video world models have shown immense potential in simulating the physical world, yet existing memory mechanisms primarily treat environments as static canvases. When dynamic subjects hide out of sight and later re-emerge, current methods often struggle, leading to frozen, distorted, or vanishing subjects. To address this, we introduce Hybrid Memory, a novel paradigm requiring models to simultaneously act as precise archivists for static backgrounds and vigilant trackers for dynamic subjects, ensuring motion continuity during out-of-view intervals. To facilitate research in this direction, we construct HM-World, the first large-scale video dataset dedicated to hybrid memory. It features 59K high-fidelity clips with decoupled camera and subject trajectories, encompassing 17 diverse scenes, 49 distinct subjects, and meticulously designed exit-entry events to rigorously evaluate hybrid coherence. Furthermore, we propose HyDRA, a specialized memory architecture that compresses memory into tokens and utilizes a spatiotemporal relevance-driven retrieval mechanism. By selectively attending to relevant motion cues, HyDRA effectively preserves the identity and motion of hidden subjects. Extensive experiments on HM-World demonstrate that our method significantly outperforms state-of-the-art approaches in both dynamic subject consistency and overall generation quality.

AI Models

Skywork/Matrix-Game-3.0


license: apache-2.0 language:

  • en base_model:
  • Wan-AI/Wan2.2-TI2V-5B pipeline_tag: image-text-to-video library_name: diffusers

Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

<div style="display: flex; justify-content: center; gap: 10px;"> <a href="https://github.com/SkyworkAI/Matrix-Game"> <img src="https://img.shields.io/badge/GitHub-100000?style=flat&logo=github&logoColor=white" alt="GitHub"> </a> <a href="https://github.com/SkyworkAI/Matrix-Game/blob/main/Matrix-Game-3/assets/pdf/report.pdf"> <img src="https://img.shields.io/badge/Technical Report-b31b1b?style=flat&logo=arxiv&logoColor=white" alt="report"> </a> <a href="https://matrix-game-v3.github.io/"> <img src="https://img.shields.io/badge/Project%20Page-grey?style=flat&logo=huggingface&color=FFA500" alt="Project Page"> </a> </div>

📝 Overview

Matrix-Game-3.0 is an open-sourced, memory-augmented interactive world model designed for 720p real-time long-form video generation.

Framework Overview

Our framework unifies three stages into an end-to-end pipeline:

  • Data Engine — an industrial-scale infinite data engine integrating Unreal Engine synthetic scenes, large-scale automated AAA game collection,and real-world video augmentation to produce high-quality Video-Pose-Action-Prompt quadruplets at scale;
  • Model Training — a memory-augmented Diffusion Transformer (DiT) with an error buffer that learns action-conditioned generation with memory-enhanced long-horizon consistency;
  • Inference Deployment — few-step sampling, INT8 quantization, and model distillation achieving 720p@40FPS real-time generation with a 5B model.

Model Overview

✨ Key Features

  • 🚀 Feature 1: Upgraded Data Engine: Combines Unreal Engine-based synthetic data, large-scale automated AAA game data, and real-world video augmentation to generate high-quality Video–Pose–Action–Prompt data.
  • 🖱️ Feature 2: Long-horizon Memory & Consistency: Uses prediction residuals and frame re-injection for self-correction, while camera-aware memory ensures long-term spatiotemporal consistency.
  • 🎬 Feature 3: Real-Time Interactivity & Open Access: It employs a multi-segment autoregressive distillation strategy based on Distribution Matching Distillation (DMD), combined with model quantization and VAE decoder distillation to support [40fps] real-time generation at 720p resolution with a 5B model, while maintaining stable memory consistency over minute-long sequence.
  • 👍 Feature 3: Scale Up 28B-MoE Model: Scaling up to a 2×14B model further improves generation quality, dynamics, and generalization.

🔥 Latest Updates

  • [2026-03] 🎉 Initial release of Matrix-Game-3.0 Model

🚀 Quick Start

Installation

Create a conda environment and install dependencies:

conda create -n matrix-game-3.0 python=3.12 -y
conda activate matrix-game-3.0
# install FlashAttention
# Our project also depends on [FlashAttention](https://github.com/Dao-AILab/flash-attention)
git clone https://github.com/SkyworkAI/Matrix-Game-3.0.git
cd Matrix-Game-3.0
pip install -r requirements.txt

Model Download

pip install "huggingface_hub[cli]"
huggingface-cli download Matrix-Game-3.0 --local-dir Matrix-Game-3.0

Inference

Before running inference, you need to prepare:

  • Input image
  • Text prompt

After downloading pretrained models, you can use the following command to generate an interactive video with random actions:

torchrun --nproc_per_node=$NUM_GPUS generate.py --size 704*1280 --dit_fsdp --t5_fsdp --ckpt_dir Matrix-Game-3.0 --fa_version 3 --use_int8 --num_iterations 12 --num_inference_steps 3 --image demo_images/000/image.png --prompt "a vintage gas station with a classic car parked under a canopy, set against a desert landscape." --save_name test --seed 42 --compile_vae --lightvae_pruning_rate 0.5 --vae_type mg_lightvae --output_dir ./output
# "num_iterations" refers to the number of iterations you want to generate. The total number of frames generated is given by:57 + (num_iterations - 1) * 40 

Tips: If you want to use the base model, you can use "--use_base_model --num_inference_steps 50". Otherwise if you want to generating the interactive videos with your own input actions, you can use "--interactive". With multiple GPUs, you can pass --use_async_vae --async_vae_warmup_iters 1 to speed up inference.

⭐ Acknowledgements

📖 Citation

If you find this work useful for your research, please kindly cite our paper:

  @misc{2026matrix,
    title={Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory},
    author={{Skywork AI Matrix-Game Team}},
    year={2026},
    howpublished={Technical report},
    url={https://github.com/SkyworkAI/Matrix-Game/blob/main/Matrix-Game-3/assets/pdf/report.pdf}
  }

Author: Skywork

Likes: 28

Downloads: 0

Tags: diffusers, safetensors, image-text-to-video, en, base_model:Wan-AI/Wan2.2-TI2V-5B, base_model:finetune:Wan-AI/Wan2.2-TI2V-5B, license:apache-2.0, region:us

0xSero/Qwen-3.5-28B-A3B-REAP


language:

  • en license: apache-2.0 base_model: Qwen/Qwen3.5-35B-A3B tags:
  • moe
  • pruning
  • reap
  • qwen3
  • expert-pruning model-index:
  • name: Qwen3.5-35B-A3B-REAP-20pct results:
    • task: type: text-generation dataset: name: HumanEval type: openai_humaneval metrics:
      • type: pass@1 value: 0.732 name: pass@1
    • task: type: text-generation dataset: name: HumanEval+ type: evalplus metrics:
      • type: pass@1 value: 0.701 name: pass@1
    • task: type: text-generation dataset: name: MMLU type: cais/mmlu metrics:
      • type: acc value: 0.809 name: accuracy

Qwen3.5-35B-A3B — REAP 20% Expert Pruning

This is a 20% expert-pruned version of Qwen/Qwen3.5-35B-A3B, produced using the REAP (Router-weighted Expert Activation Pruning) method from the paper "REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression" (ICLR 2026).

The model retains 205 of 256 experts (~80% of original experts) while remaining competitive on standard benchmarks.


What We Did

Method: REAP Layerwise Pruning

REAP prunes MoE experts by scoring each expert's importance using a combination of:

  1. Router gate-values — how often and how strongly the router selects each expert
  2. Expert activation norms — the magnitude of each expert's output contribution

Router logit weights are renormalized to sum to 1 after pruning (critical for maintaining output scale). Pruning is applied layer-by-layer (layerwise mode).

Calibration Data

Observations were collected over a mixed calibration dataset of 1,000 samples per category:

  • theblackcat102/evol-codealpaca-v1 (250 samples)
  • open-r1/Mixture-of-Thoughts[code] (250 samples)
  • open-r1/Mixture-of-Thoughts[math] (250 samples)
  • open-r1/Mixture-of-Thoughts[science] (250 samples)

Max sequence length: 4096 tokens. Angular distance measure for expert similarity.

Pruning Config

| Parameter | Value | |---|---| | Compression ratio | 20% (51 experts removed) | | Original experts | 256 | | Remaining experts | 205 | | Pruning method | REAP | | Router weight renormalization | ✓ | | Seed | 42 | | Calibration samples | 1,000 total |


Benchmark Results

All evaluations run with vLLM (tensor-parallel across 8x RTX 3090), greedy decoding, 0-shot.

Coding (EvalPlus, greedy, 0-shot)

| Benchmark | Original | Pruned (20%) | Delta | |---|---|---|---| | HumanEval (pass@1) | 76.2% | 73.2% | -3.0% | | HumanEval+ (pass@1) | 72.0% | 70.1% | -1.9% |

Multiple Choice / Reasoning (lm-eval, 0-shot, 250 samples/task)

| Benchmark | Original | Pruned (20%) | Delta | |---|---|---|---| | MMLU | 84.34% | 80.89% | -3.45% | | MMLU - Humanities | 82.40% | 76.35% | -6.05% | | MMLU - Social Sciences | 90.04% | 88.38% | -1.66% | | MMLU - STEM | 81.46% | 78.88% | -2.58% | | MMLU - Other | 84.52% | 81.05% | -3.47% | | ARC-Challenge | 60.00% | 60.40% | +0.40% | | ARC-Easy | 84.00% | 83.20% | -0.80% | | BoolQ | 88.00% | 89.20% | +1.20% | | HellaSwag (norm) | 76.40% | 75.60% | -0.80% | | OpenBookQA (norm) | 45.20% | 47.20% | +2.00% | | RTE | 81.20% | 82.00% | +0.80% | | WinoGrande | 77.20% | 76.80% | -0.40% |

Perplexity (WikiText-2, 10k tokens, llama.cpp)

| Model | PPL | |---|---| | Original (256 experts) | 6.83 | | Pruned 20% (205 experts) | 9.51 |

Throughput (4x RTX 3090, TP=4, vLLM, enforce_eager)

| Batch Size | Original tok/s | Pruned tok/s | Speedup | |---|---|---|---| | 1 | 12.3 | 12.5 | 1.02x | | 4 | 37.0 | 36.0 | 0.97x | | 8 | 74.4 | 70.3 | 0.95x | | 16 | 89.3 | 86.0 | 0.96x |

Note: throughput speedup is minimal at this compression level with current vLLM routing overhead. The primary benefit is reduced VRAM footprint.


Memory Footprint

| Model | Size | Shards | |---|---|---| | Original | ~71 GB (bf16) | 14 safetensors | | Pruned 20% | ~53 GB (bf16) | 2 safetensors |


Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "0xSero/Qwen3.5-35B-A3B-REAP-20pct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)

messages = [{"role": "user", "content": "Write a quicksort in Python."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

With vLLM:

vllm serve 0xSero/Qwen3.5-35B-A3B-REAP-20pct \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 32768

Reproducing

git clone https://github.com/cerebras/reap
cd reap
bash scripts/build.sh

python -m reap.layerwise_prune \
    --model_name Qwen/Qwen3.5-35B-A3B \
    --dataset_name "theblackcat102/evol-codealpaca-v1:250,open-r1/Mixture-of-Thoughts[code]:250,open-r1/Mixture-of-Thoughts[math]:250,open-r1/Mixture-of-Thoughts[science]:250" \
    --compression_ratio 0.20 \
    --prune_method reap \
    --seed 42 \
    --renormalize_router_weights true

Citation

@inproceedings{lasby2025reap,
  title={REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author={Lasby, Mike and others},
  booktitle={ICLR 2026},
  year={2026}
}

Author: 0xSero

Likes: 22

Downloads: 0

Tags: safetensors, qwen3_5_moe, moe, pruning, reap, qwen3, expert-pruning, en, arxiv:2510.13999, base_model:Qwen/Qwen3.5-35B-A3B, base_model:finetune:Qwen/Qwen3.5-35B-A3B, license:apache-2.0, model-index, region:us

Comfy-Org/ltx-2.3

Author: Comfy-Org

Likes: 8

Downloads: 0

Tags: region:us

alvdansen/illustration-1.0-qwen-image


license: other base_model: Qwen/Qwen-Image tags:

  • qwen-image
  • lora
  • illustration
  • anime
  • style
  • bande-dessinee
  • graphic-novel
  • risograph language:
  • en pipeline_tag: text-to-image library_name: peft

Illustration 1.0 -- Qwen-Image

A style LoRA for Qwen-Image trained on 244 curated illustration and anime reference images across diverse visual styles. Produces images spanning cute anime character design, European bande dessinee, indie risograph prints, storybook watercolor, retro shoujo manga, and graphic novel illustration.

Style Influences

  • Japanese anime and manga illustration -- clean cel shading, expressive character design, shoujo and slice-of-life aesthetics
  • European graphic novel and bande dessinee -- ink crosshatching, atmospheric landscapes, ligne claire
  • Indie illustration and zine culture -- risograph print textures, limited palettes, grainy halftone
  • Children's book and storybook illustration -- watercolor washes, warm palettes, charming character proportions
  • Retro anime aesthetics -- 80s/90s anime film stills, VHS grain, bold color blocking
  • Concept art and character design -- flat color fills, turnaround sheets, fashion illustration

Usage

No trigger word -- this is a style LoRA. Describe what you want and add style cues like "ink and watercolor", "clean cel shading", "risograph print" to steer the output.

Recommended Inference Settings

Sampler: euler
Scheduler: simple
CFG: 3.5
Steps: 45 (30-60 works well)
LoRA strength: 0.8-1.0

Sample Generations

<Gallery />

| | Prompt | |---|--------| | | close-up, a girl with short messy silver hair and round glasses, wearing a chunky knit turtleneck sweater, one hand tucking hair behind her ear, looking slightly past the viewer with half-closed eyes, anime style | | | a girl with short hair in a bomber jacket leaning against a wall, clean cel shading, bold graphic composition, 90s ranma era anime, film grain | | | a wanderer approaching a stone gate in the desert, european graphic novel, detailed ink hatching, warm sand tones, moebius style | | | a fox sleeping in a hollowed-out log, children's book watercolor, soft wet-on-wet washes, autumn leaf palette | | | portrait, a girl with star-shaped hair clips, bold graphic shapes, limited three-color palette, screen print flatness, harajuku fashion illustration | | | a cat wearing a tiny cape perched on a fence post, indie risograph print, two-color teal and coral, grainy paper texture | | | an astronaut sitting on a rocky surface with a small robot, retro watercolor, warm olive and cream tones, hand-painted feel | | | wide shot, a girl on a bicycle coasting downhill, ghibli film still, clean cel shading, golden hour warmth, anime style lofi | | | wide shot, a lone figure on a cliff overlooking the sea with seagulls, bande dessinee, fine ink hatching, muted blue-gray | | | portrait, a girl with flowers growing from her hair, risograph print, three-color pink blue and cream, grainy texture | | | a boy fixing a radio on a cluttered workbench, warm tungsten light, retro gouache illustration, ochre and burnt sienna palette | | | close-up, a girl with braids and paint-stained fingers holding a sketchbook to her chest, soft cel shading, 90s anime, faded VHS warmth | | | wide shot, a girl asleep in a hammock strung between two bookshelves, cozy interior light, retro watercolor, cream and dusty gold tones, hand-painted feel |

Training Details

  • Base model: Qwen-Image (FP8 quantized, text encoder FP8)
  • Training steps: 59,000
  • Rank/Alpha: 42/42
  • Learning rate: 5e-5
  • Optimizer: AdamW 8-bit
  • Caption dropout: 0.35
  • EMA: enabled (decay 0.99)
  • Dataset: 244 curated images across 4 subsets, trained sequentially then consolidated on the full dataset
  • Trainer: ai-toolkit by Ostris

Author: alvdansen

Likes: 5

Downloads: 0

Tags: peft, qwen-image, lora, illustration, anime, style, bande-dessinee, graphic-novel, risograph, text-to-image, en, base_model:Qwen/Qwen-Image, base_model:adapter:Qwen/Qwen-Image, license:other, region:us

karakuri-ai/karakuri-vl-2-8b-thinking-2603


license: apache-2.0 language:

  • ja
  • en base_model:
  • Qwen/Qwen3-VL-8B-Thinking pipeline_tag: image-text-to-text library_name: transformers

KARAKURI VL 2 8B Thinking 2603

Model Details

Model Description

  • Developed by: KARAKURI Inc.
  • Model type: Vision-Language Models
  • Languages: Japanese and English
  • License: Apache 2.0
  • Finetuned from model: Qwen/Qwen3-VL-8B-Thinking
  • Contact: For questions and comments about the model, please email karakuri-rd@karakuri.ai

Usage

Use in 🤗 Transformers

First, install the required dependencies:

pip install transformers accelerate qwen-vl-utils[decord]==0.0.8

Then, use the following code to load the model and generate responses:

from transformers import AutoModelForImageTextToText, AutoProcessor

model_name = "karakuri-ai/karakuri-vl-2-8b-thinking-2603"
model = AutoModelForImageTextToText.from_pretrained(
    model_name, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_name)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Training Details

Training Infrastructure

  • Hardware: The model was trained on Amazon EC2 trn2.48xlarge instances.
  • Software: We use code based on neuronx-distributed.

Acknowledgments

This work was supported by the Ministry of Economy, Trade and Industry (METI) and the New Energy and Industrial Technology Development Organization (NEDO) through the Generative AI Accelerator Challenge (GENIAC).

Citation

@misc{karakuri_vl_2_8b_thinking_2603,
    author       = { {KARAKURI} {Inc.} },
    title        = { {KARAKURI} {VL} 2 8{B} {Thinking} 2603 },
    year         = { 2026 },
    url          = { https://huggingface.co/karakuri-ai/karakuri-vl-2-8b-thinking-2603 },
    publisher    = { {Hugging Face} },
    journal      = { {Hugging Face} repository }
}

Author: karakuri-ai

Likes: 4

Downloads: 0

Tags: transformers, safetensors, qwen3_vl, image-text-to-text, conversational, ja, en, base_model:Qwen/Qwen3-VL-8B-Thinking, base_model:finetune:Qwen/Qwen3-VL-8B-Thinking, license:apache-2.0, endpoints_compatible, region:us

bigatuna/Qwen3.5-9b-Sushi-Coder-RL-GGUF


license: apache-2.0 language:

  • en pipeline_tag: text-generation base_model: bigatuna/Qwen3.5-9b-Sushi-Coder-RL base_model_relation: quantized datasets:
  • open-r1/codeforces-cots
  • nohurry/Opus-4.6-Reasoning-3000x-filtered tags:
  • gguf
  • llama.cpp
  • qwen3_5
  • code
  • rl
  • atropos
  • multimodal

Qwen3.5-9b-Sushi-Coder-RL-GGUF

Qwen3.5-9b-Sushi-Coder-RL-GGUF

Lineage

Training

The upstream SFT model was trained with Unsloth on:

The RL stage was then run for coding with NousResearch/hermes-agent using NousResearch/atropos.

During that run, vLLM was patched with vllm-project/vllm PR #36395, fix(lora): add bounds checking for TP configurations, to address the LoRA tensor-parallel bounds issue.

Files

  • Qwen3.5-9b-Sushi-Coder-RL.Q4_K_M.gguf
  • Qwen3.5-9b-Sushi-Coder-RL.Q8_0.gguf
  • Qwen3.5-9b-Sushi-Coder-RL.BF16-mmproj.gguf

Usage Note

This is a multimodal Qwen 3.5 export. Use the text GGUF together with the BF16-mmproj file.

Quick Start

Example download commands with the Hugging Face CLI:

hf download bigatuna/Qwen3.5-9b-Sushi-Coder-RL-GGUF \
  Qwen3.5-9b-Sushi-Coder-RL.Q4_K_M.gguf \
  Qwen3.5-9b-Sushi-Coder-RL.BF16-mmproj.gguf

Alternative quant:

hf download bigatuna/Qwen3.5-9b-Sushi-Coder-RL-GGUF \
  Qwen3.5-9b-Sushi-Coder-RL.Q8_0.gguf \
  Qwen3.5-9b-Sushi-Coder-RL.BF16-mmproj.gguf

Metadata

  • License: Apache-2.0
  • Architecture: Qwen 3.5
  • Format: GGUF
  • Tags: llama.cpp, qwen3_5, multimodal, code, rl, conversational

Author: bigatuna

Likes: 4

Downloads: 14

Tags: gguf, llama.cpp, qwen3_5, code, rl, atropos, multimodal, text-generation, en, dataset:open-r1/codeforces-cots, dataset:nohurry/Opus-4.6-Reasoning-3000x-filtered, base_model:bigatuna/Qwen3.5-9b-Sushi-Coder-RL, base_model:quantized:bigatuna/Qwen3.5-9b-Sushi-Coder-RL, license:apache-2.0, endpoints_compatible, region:us, conversational

0xSero/GLM-5-REAP-50pct-Q3_K_M-GGUF


license: apache-2.0 base_model: zai-org/GLM-5 tags:

  • moe
  • pruned
  • reap
  • gguf
  • llama-cpp
  • glm5
  • quantized
  • Q3_K_M model_type: glm_moe_dsa

GLM-5 REAP-50% Q3_K_M GGUF (3.82 BPW, ~170 GB)

Expert-pruned GLM-5 (744B -> ~372B params, 256 -> 128 routed experts) quantized to Q3_K_M (3.82 BPW). Best quality among our quantized variants.

Benchmark Results (Pilot, 10 samples/category)

| Category | Q3_K_M (this) | UD-IQ2_M (121GB) | UD-IQ2_XXS (97GB) | |---|---|---|---| | Math (GSM8K) | 8/10 (80%) | 6/10 (60%) | 2/10 (20%) | | Reasoning (BBH) | 8/10 (80%) | 7/10 (70%) | 4/10 (40%) | | Coding (HumanEval) | 9/10 (90%) | 8/10 (80%) | 7/10 (70%) | | Agentic (SWE-bench) | 10/10 (100%) | 10/10 (100%) | 10/10 (100%) | | Terminal-bench | 9/10 (90%) | 9/10 (90%) | 10/10 (100%) | | Overall | 44/50 (88%) | 40/50 (80%) | 33/50 (66%) |

Model Details

| Property | Value | |---|---| | Base model | zai-org/GLM-5 (744B, 256 routed experts) | | Pruning | REAP saliency pruning, 50% expert removal (256 -> 128 experts) | | Quantization | Q3_K_M (3.82 BPW) | | Size | ~170 GB | | Architecture | GlmMoeDsaForCausalLM (MLA + MoE + DSA) | | Active params | ~20B per token (8 of 128 experts) |

Usage

huggingface-cli download 0xSero/GLM-5-REAP-50pct-Q3_K_M-GGUF --local-dir ./model

./llama-server \
    --model ./model/GLM-5-REAP-50pct-Q3_K_M.gguf \
    --ctx-size 8192 \
    --n-gpu-layers 99 \
    --port 8080 \
    --reasoning-budget 2048

Requires ~180 GB VRAM (model + KV cache). Fits on 4x H100 80GB or 1x B200 192GB.

All Variants

| Variant | BPW | Size | Parse Rate | Repo | |---|---|---|---|---| | BF16 | 16.00 | 711 GB | N/A | BF16-GGUF | | Q3_K_M (this) | 3.82 | 170 GB | 88% | this repo | | UD-IQ2_M | 2.72 | 121 GB | 80% | UD-IQ2_M-GGUF | | UD-IQ2_XXS | 2.19 | 97 GB | 66% | UD-IQ2_XXS-GGUF |

Author: 0xSero

Likes: 2

Downloads: 0

Tags: gguf, moe, pruned, reap, llama-cpp, glm5, quantized, Q3_K_M, base_model:zai-org/GLM-5, base_model:quantized:zai-org/GLM-5, license:apache-2.0, endpoints_compatible, region:us, conversational

foadmk/context-1-MLX-MXFP4


language:

  • en license: apache-2.0 library_name: mlx tags:
  • mlx
  • apple-silicon
  • moe
  • mixture-of-experts
  • 4-bit
  • quantized
  • gpt-oss
  • context-retrieval base_model: chromadb/context-1 pipeline_tag: text-generation model-index:
  • name: context-1-MLX-MXFP4 results:
    • task: type: text-generation metrics:
      • name: Tokens per second (M1 Max) type: throughput value: 69
      • name: Peak Memory (GB) type: memory value: 12

chromadb/context-1 MLX MXFP4

This model was converted from chromadb/context-1 to MLX format with MXFP4 (4-bit) quantization for efficient inference on Apple Silicon.

Model Description

  • Base Model: chromadb/context-1 (fine-tuned from openai/gpt-oss-20b)
  • Architecture: 20B parameter Mixture of Experts (MoE) with 32 experts, 4 active per token
  • Format: MLX with MXFP4 quantization
  • Quantization: 4.504 bits per weight

Performance (Apple M1 Max, 64GB)

| Metric | Value | |--------|-------| | Model Size | 11 GB | | Peak Memory | 12 GB | | Generation Speed | ~69 tokens/sec | | Prompt Processing | ~70 tokens/sec | | Latency | ~14.5 ms/token |

Usage

from mlx_lm import load, generate

model, tokenizer = load("foadmk/context-1-MLX-MXFP4")
response = generate(model, tokenizer, prompt="What is the capital of France?", max_tokens=100, verbose=True)

Conversion Notes

The chromadb/context-1 model uses a different weight format than the original openai/gpt-oss-20b, which required custom conversion logic:

Key Differences from Original Format

  • Dense BF16 tensors (not quantized blocks with _blocks suffix)
  • gate_up_proj shape: (experts, hidden, intermediate*2) with interleaved gate/up weights

Weight Transformations Applied

  1. gate_up_proj (32, 2880, 5760):

    • Transpose to (32, 5760, 2880)
    • Interleaved split: [:, ::2, :] for gate, [:, 1::2, :] for up
    • Result: gate_proj.weight and up_proj.weight each (32, 2880, 2880)
  2. down_proj (32, 2880, 2880):

    • Transpose to match MLX expected format
  3. Bypass mlx_lm sanitize: Pre-naming weights with .weight suffix to skip incorrect splitting

Conversion Script

A conversion script is included in this repo: convert_context1_to_mlx.py

python convert_context1_to_mlx.py --output ./context1-mlx-mxfp4

Intended Use

This model is optimized for:

  • Context-aware retrieval and search tasks
  • Running locally on Apple Silicon Macs
  • Low-latency inference without GPU requirements

Limitations

  • Requires Apple Silicon Mac with MLX support
  • Best performance on M1 Pro/Max/Ultra or newer with 32GB+ RAM
  • Model outputs structured JSON-like responses (inherited from base model training)

Citation

If you use this model, please cite the original:

@misc{chromadb-context-1,
  author = {Chroma},
  title = {Context-1: A Fine-tuned GPT-OSS Model for Retrieval},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/chromadb/context-1}
}

Acknowledgments

Author: foadmk

Likes: 2

Downloads: 0

Tags: mlx, safetensors, gpt_oss, apple-silicon, moe, mixture-of-experts, 4-bit, quantized, gpt-oss, context-retrieval, text-generation, en, base_model:chromadb/context-1, base_model:quantized:chromadb/context-1, license:apache-2.0, model-index, region:us

huihui-ai/Huihui-Mistral-Small-4-119B-2603-BF16-abliterated-GGUF


license: apache-2.0 language:

  • en
  • fr
  • de
  • es
  • pt
  • it
  • ja
  • ko
  • ru
  • zh
  • ar
  • fa
  • id
  • ms
  • ne
  • pl
  • ro
  • sr
  • sv
  • tr
  • uk
  • vi
  • hi
  • bn pipeline_tag: image-text-to-text base_model:
  • mistralai/Mistral-Small-4-119B-2603 tags:
  • abliterated
  • uncensored
  • GGUF

extra_gated_prompt: >- Usage Warnings

“**Risk of Sensitive or Controversial Outputs**“: This model’s safety filtering has been significantly reduced, potentially generating sensitive, controversial, or inappropriate content. Users should exercise caution and rigorously review generated outputs.

“**Not Suitable for All Audiences**:“ Due to limited content filtering, the model’s outputs may be inappropriate for public settings, underage users, or applications requiring high security.

“**Legal and Ethical Responsibilities**“: Users must ensure their usage complies with local laws and ethical standards. Generated content may carry legal or ethical risks, and users are solely responsible for any consequences.

“**Research and Experimental Use**“: It is recommended to use this model for research, testing, or controlled environments, avoiding direct use in production or public-facing commercial applications.

“**Monitoring and Review Recommendations**“: Users are strongly advised to monitor model outputs in real-time and conduct manual reviews when necessary to prevent the dissemination of inappropriate content.

“**No Default Safety Guarantees**“: Unlike standard models, this model has not undergone rigorous safety optimization. huihui.ai bears no responsibility for any consequences arising from its use.

huihui-ai/Huihui-Mistral-Small-4-119B-2603-BF16-abliterated-GGUF

This is an uncensored version of mistralai/Mistral-Small-4-119B-2603 created with abliteration (see remove-refusals-with-transformers to know more about it). This is a crude, proof-of-concept implementation to remove refusals from an LLM model without using TransformerLens.

Note

In this ablation procedure, the weights of the first 10 layers were maintained, and only the weights of the layers that followed were adjusted. It is possible that in very few cases, the results will be rejected as a result of these changes.

Download and merge

Use the llama.cpp split program to merge model (llama-gguf-split needs to be compiled.),

huggingface-cli download huihui-ai/Huihui-Mistral-Small-4-119B-2603-BF16-abliterated-GGUF --local-dir ./huihui-ai/Huihui-Mistral-Small-4-119B-2603-BF16-abliterated-GGUF --token xxx


llama-gguf-split --merge huihui-ai/Huihui-Mistral-Small-4-119B-2603-BF16-abliterated-GGUF/Q4_K-GGUF/Q4_K-GGUF-00001-of-00008.gguf huihui-ai/Huihui-Mistral-Small-4-119B-2603-BF16-abliterated-GGUF/ggml-model-Q4_K.gguf

Usage Warnings

  • Risk of Sensitive or Controversial Outputs: This model’s safety filtering has been significantly reduced, potentially generating sensitive, controversial, or inappropriate content. Users should exercise caution and rigorously review generated outputs.

  • Not Suitable for All Audiences: Due to limited content filtering, the model’s outputs may be inappropriate for public settings, underage users, or applications requiring high security.

  • Legal and Ethical Responsibilities: Users must ensure their usage complies with local laws and ethical standards. Generated content may carry legal or ethical risks, and users are solely responsible for any consequences.

  • Research and Experimental Use: It is recommended to use this model for research, testing, or controlled environments, avoiding direct use in production or public-facing commercial applications.

  • Monitoring and Review Recommendations: Users are strongly advised to monitor model outputs in real-time and conduct manual reviews when necessary to prevent the dissemination of inappropriate content.

  • No Default Safety Guarantees: Unlike standard models, this model has not undergone rigorous safety optimization. huihui.ai bears no responsibility for any consequences arising from its use.

Donation

If you like it, please click 'like' and follow us for more updates.
You can follow x.com/support_huihui to get the latest model information from huihui.ai.

Your donation helps us continue our further development and improvement, a cup of coffee can do it.
  • bitcoin(BTC):
  bc1qqnkhuchxw0zqjh2ku3lu4hq45hc6gy84uk70ge
  • Support our work on Ko-fi (https://ko-fi.com/huihuiai)!

Author: huihui-ai

Likes: 2

Downloads: 0

Tags: gguf, abliterated, uncensored, GGUF, image-text-to-text, en, fr, de, es, pt, it, ja, ko, ru, zh, ar, fa, id, ms, ne, pl, ro, sr, sv, tr, uk, vi, hi, bn, base_model:mistralai/Mistral-Small-4-119B-2603, base_model:quantized:mistralai/Mistral-Small-4-119B-2603, license:apache-2.0, endpoints_compatible, region:us, conversational

timteh673/Nemotron-3-Super-120B-A12B-Uncensored-GGUF


base_model: nvidia/Nemotron-3-Super-120B-Base library_name: transformers tags:

  • 120b
  • uncensored
  • abliterated
  • nemotron
  • nvidia
  • gguf language:
  • en

Nemotron-3-Super-120B-A12B-Uncensored-GGUF

Forged on 8×H200 SXM5 | 1.1TB VRAM

Support the computes: Buy Me A Coffee

This is the fully uncensored, abliterated version of Nemotron-3-Super-120B-Base, quantized into GGUF formats for consumer hardware. We utilize our 1.1TB VRAM compute cluster to apply surgical refusal-direction removal natively on the BF16 weights before converting to GGUF, ensuring zero cognitive degradation.

Need a custom 120B+ model aligned to your proprietary data? TIMTEH provides bespoke enterprise fine-tuning on 8×H200 SXM5. Contact: tim@timlex.co

Abliteration Methodology

  1. Model Loading: Full device_map="auto" distribution across 8×H200s (no Unsloth or quantization wrappers used).
  2. Activation Extraction: Pass 30 harmful and 30 harmless high-entropy prompts through the model to identify the refusal direction vector in the MLP and Attention projection spaces (Layers 9+).
  3. Causal Projection: The refusal direction is subtracted via orthogonal projection, effectively short-circuiting the safety policy without damaging the general knowledge boundaries.
  4. Validation: Tested on standard benchmark tasks to verify that reasoning and knowledge retrieval remain intact.

Benchmarks (Evaluated on 8xH200)

(Benchmark evaluation data is compiling...) | Task | Result | | --- | --- | | MMLU | 85.48% | | HellaSwag | 68.48% | | ARC-Combined | 64.16% | | Winogrande | 73.48% | | TruthfulQA | 69.72% |

llama-bench Throughput (Tokens/sec)

| Model Format | pp128 | pp256 | pp512 | tg128 | |---|---|---|---|---| | BF16 (Baseline) | [PENDING] | [PENDING] | [PENDING] | [PENDING] | | Q8_0 | [PENDING] | [PENDING] | [PENDING] | [PENDING] | | Q6_K | [PENDING] | [PENDING] | [PENDING] | [PENDING] | | Q5_K_M | [PENDING] | [PENDING] | [PENDING] | [PENDING] | | Q4_K_M | [PENDING] | [PENDING] | [PENDING] | [PENDING] | | Q3_K_M | [PENDING] | [PENDING] | [PENDING] | [PENDING] | | Q2_K | [PENDING] | [PENDING] | [PENDING] | [PENDING] |

(Perplexity on Wikitext-2 for Q4_K_M and Q8_0: [PENDING])

Downloads & Formats

| File | Size | Bits/Weight | Recommendation | | --- | --- | --- | --- | | Nemotron-3-Super-120B-A12B-Uncensored-BF16.gguf | 241 GB | 16.0 | For research and native FP16 clusters | | Nemotron-3-Super-120B-A12B-Uncensored-Q8_0.gguf | 128 GB | 8.5 | Virtually indistinguishable from BF16 | | Nemotron-3-Super-120B-A12B-Uncensored-Q6_K.gguf | 112 GB | 6.5 | Excellent balance | | Nemotron-3-Super-120B-A12B-Uncensored-Q5_K_M.gguf | 95 GB | 5.5 | High quality, fits in ~100GB VRAM | | Nemotron-3-Super-120B-A12B-Uncensored-Q4_K_M.gguf | 86 GB | 4.8 | Recommended for 4x24GB setups | | Nemotron-3-Super-120B-A12B-Uncensored-Q3_K_M.gguf | 67 GB | 3.5 | Aggressive quant, minor logic loss | | Nemotron-3-Super-120B-A12B-Uncensored-Q2_K.gguf | 53 GB | 2.5 | For memory-constrained testing only |

Usage Examples

llama.cpp

./main -m Nemotron-3-Super-120B-A12B-Uncensored-Q4_K_M.gguf -n 2048 -p "Explain quantum field theory."

Hardware Recommendations

  • Q4_K_M / Q5_K_M: Best run on 4x RTX 3090/4090 or 2x A6000.
  • Q8_0 / BF16: Recommended 8x GPU cluster or Mac Studio with 192GB Unified Memory.

License

Apache 2.0 (Inherited from base model)

Author: timteh673

Likes: 2

Downloads: 0

Tags: transformers, gguf, 120b, uncensored, abliterated, nemotron, nvidia, en, endpoints_compatible, region:us, conversational