Todays AI Summary

AI Developments: Scene Generation, Tool-Integrated Reasoning, and More

Here's a look at the latest AI models and research papers:

Research Highlights

  • SceneGen: Single-Image 3D Scene Generation: A new framework, SceneGen, generates multiple 3D assets from a single scene image in one feedforward pass, without optimization or asset retrieval. It uses a feature aggregation module to integrate local and global scene information. The paper demonstrates its extensibility to multi-image inputs and shows robust generation abilities.
  • Tool-Integrated Reasoning (TIR) Analysis: A comprehensive benchmark called ReasonZoo evaluates the effectiveness of TIR across various domains. The study introduces new metrics, Performance-Aware Cost (PAC) and Area Under the Performance-Cost Curve (AUC-PCC), to assess reasoning efficiency. Results show that TIR-enabled models outperform non-TIR models and enhance reasoning efficiency.
  • Language-Guided Tuning (LGT): This framework uses multi-agent Large Language Models to optimize configurations through natural language reasoning. It applies textual gradients to provide semantic understanding of training dynamics and configuration interdependencies, demonstrating improvements over traditional optimization methods.
  • Neural Robot Dynamics (NeRD): This paper introduces NeRD, a learned robot-specific dynamics model for predicting future states of articulated rigid bodies under contact constraints. NeRD replaces low-level dynamics and contact solvers in analytical simulators and uses a robot-centric, spatially-invariant simulation state representation. The models are stable, accurate, and generalizable across tasks and environments.
  • End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning: This paper introduces Deep-DxSearch, an agentic RAG system trained end-to-end with reinforcement learning (RL) that enables steer tracebale retrieval-augmented reasoning for medical diagnosis.

Model Updates

  • damnthatai/Apple_QuickTake_150_Digital_Camera_Qwen: This model is a LoRA trained on photos from QuickTake 150 cameras, designed for text-to-image generation using the Qwen base model. It works best at a resolution of 640x480 and a CFG of 3-3.5.
  • yarikdevcom/Seed-OSS-36B-Instruct-GGUF: This is an experimental GGUF quantization of the ByteDance-Seed/Seed-OSS-36B-Instruct model, intended for text generation and conversational applications.
  • ArtusDev/Trappu_Magnum-Picaro-0.7-v2-12b-EXL3: This model provides EXL3 quants of Trappu/Magnum-Picaro-0.7-v2-12b, using exllamav3 for quantization. It includes various quants with different bits per weight and head bits.
  • hmhm1229/R1-Router-3B: This model is associated with the paper "R1-Router: Learning to Route Queries across Knowledge Bases for Step-wise Retrieval-Augmented Reasoning".

Key Takeaways

  • 3D Scene Generation Advances: SceneGen offers a novel approach to generating 3D content from single images, potentially impacting VR/AR and embodied AI applications.
  • Tool Integration Enhances Reasoning: Research on Tool-Integrated Reasoning (TIR) demonstrates its effectiveness in improving the reasoning abilities of LLMs across diverse tasks.
  • Language Models for Optimization: Language-Guided Tuning (LGT) shows promise in using LLMs to optimize configurations through natural language reasoning, offering interpretability and adaptability.
  • Robotics Simulation Improvements: NeRD provides a way to learn generalizable neural simulators for robots, improving accuracy and enabling policy learning in a neural engine.
  • Medical Diagnosis with Agentic RAG: Deep-DxSearch demonstrates the effectiveness of end-to-end agentic RL training for medical diagnosis, surpassing strong diagnostic baselines.

AI Papers for 2026-04-11

Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

The advent of agentic multimodal models has empowered systems to actively interact with external environments. However, current agents suffer from a profound meta-cognitive deficit: they struggle to arbitrate between leveraging internal knowledge and querying external utilities. Consequently, they frequently fall prey to blind tool invocation, resorting to reflexive tool execution even when queries are resolvable from the raw visual context. This pathological behavior precipitates severe latency bottlenecks and injects extraneous noise that derails sound reasoning. Existing reinforcement learning protocols attempt to mitigate this via a scalarized reward that penalizes tool usage. Yet, this coupled formulation creates an irreconcilable optimization dilemma: an aggressive penalty suppresses essential tool use, whereas a mild penalty is entirely subsumed by the variance of the accuracy reward during advantage normalization, rendering it impotent against tool overuse. To transcend this bottleneck, we propose HDPO, a framework that reframes tool efficiency from a competing scalar objective to a strictly conditional one. By eschewing reward scalarization, HDPO maintains two orthogonal optimization channels: an accuracy channel that maximizes task correctness, and an efficiency channel that enforces execution economy exclusively within accurate trajectories via conditional advantage estimation. This decoupled architecture naturally induces a cognitive curriculum-compelling the agent to first master task resolution before refining its self-reliance. Extensive evaluations demonstrate that our resulting model, Metis, reduces tool invocations by orders of magnitude while simultaneously elevating reasoning accuracy.

SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds

Robotic manipulation with deformable objects represents a data-intensive regime in embodied learning, where shape, contact, and topology co-evolve in ways that far exceed the variability of rigids. Although simulation promises relief from the cost of real-world data acquisition, prevailing sim-to-real pipelines remain rooted in rigid-body abstractions, producing mismatched geometry, fragile soft dynamics, and motion primitives poorly suited for cloth interaction. We posit that simulation fails not for being synthetic, but for being ungrounded. To address this, we introduce SIM1, a physics-aligned real-to-sim-to-real data engine that grounds simulation in the physical world. Given limited demonstrations, the system digitizes scenes into metric-consistent twins, calibrates deformable dynamics through elastic modeling, and expands behaviors via diffusion-based trajectory generation with quality filtering. This pipeline transforms sparse observations into scaled synthetic supervision with near-demonstration fidelity. Experiments show that policies trained on purely synthetic data achieve parity with real-data baselines at a 1:15 equivalence ratio, while delivering 90% zero-shot success and 50% generalization gains in real-world deployment. These results validate physics-aligned simulation as scalable supervision for deformable manipulation and a practical pathway for data-efficient policy learning.

Seeing but Not Thinking: Routing Distraction in Multimodal Mixture-of-Experts

Multimodal Mixture-of-Experts (MoE) models have achieved remarkable performance on vision-language tasks. However, we identify a puzzling phenomenon termed Seeing but Not Thinking: models accurately perceive image content yet fail in subsequent reasoning, while correctly solving identical problems presented as pure text. Through systematic analysis, we first verify that cross-modal semantic sharing exists in MoE architectures, ruling out semantic alignment failure as the sole explanation. We then reveal that visual experts and domain experts exhibit layer-wise separation, with image inputs inducing significant routing divergence from text inputs in middle layers where domain experts concentrate. Based on these findings, we propose the Routing Distraction hypothesis: when processing visual inputs, the routing mechanism fails to adequately activate task-relevant reasoning experts. To validate this hypothesis, we design a routing-guided intervention method that enhances domain expert activation. Experiments on three multimodal MoE models across six benchmarks demonstrate consistent improvements, with gains of up to 3.17% on complex visual reasoning tasks. Our analysis further reveals that domain expert identification locates cognitive functions rather than sample-specific solutions, enabling effective transfer across tasks with different information structures.

AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation

Text-to-Audio-Video (T2AV) generation is rapidly becoming a core interface for media creation, yet its evaluation remains fragmented. Existing benchmarks largely assess audio and video in isolation or rely on coarse embedding similarity, failing to capture the fine-grained joint correctness required by realistic prompts. We introduce AVGen-Bench, a task-driven benchmark for T2AV generation featuring high-quality prompts across 11 real-world categories. To support comprehensive assessment, we propose a multi-granular evaluation framework that combines lightweight specialist models with Multimodal Large Language Models (MLLMs), enabling evaluation from perceptual quality to fine-grained semantic controllability. Our evaluation reveals a pronounced gap between strong audio-visual aesthetics and weak semantic reliability, including persistent failures in text rendering, speech coherence, physical reasoning, and a universal breakdown in musical pitch control. Code and benchmark resources are available at http://aka.ms/avgenbench.

OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks

Group Relative Policy Optimization (GRPO) has emerged as the de facto Reinforcement Learning (RL) objective driving recent advancements in Multimodal Large Language Models. However, extending this success to open-source multimodal generalist models remains heavily constrained by two primary challenges: the extreme variance in reward topologies across diverse visual tasks, and the inherent difficulty of balancing fine-grained perception with multi-step reasoning capabilities. To address these issues, we introduce Gaussian GRPO (G$^2$RPO), a novel RL training objective that replaces standard linear scaling with non-linear distributional matching. By mathematically forcing the advantage distribution of any given task to strictly converge to a standard normal distribution, $\mathcal{N}(0,1)$, G$^2$RPO theoretically ensures inter-task gradient equity, mitigates vulnerabilities to heavy-tail outliers, and offers symmetric update for positive and negative rewards. Leveraging the enhanced training stability provided by G$^2$RPO, we introduce two task-level shaping mechanisms to seamlessly balance perception and reasoning. First, response length shaping dynamically elicits extended reasoning chains for complex queries while enforce direct outputs to bolster visual grounding. Second, entropy shaping tightly bounds the model's exploration zone, effectively preventing both entropy collapse and entropy explosion. Integrating these methodologies, we present OpenVLThinkerV2, a highly robust, general-purpose multimodal model. Extensive evaluations across 18 diverse benchmarks demonstrate its superior performance over strong open-source and leading proprietary frontier models.

RewardFlow: Generate Images by Optimizing What You Reward

We introduce RewardFlow, an inversion-free framework that steers pretrained diffusion and flow-matching models at inference time through multi-reward Langevin dynamics. RewardFlow unifies complementary differentiable rewards for semantic alignment, perceptual fidelity, localized grounding, object consistency, and human preference, and further introduces a differentiable VQA-based reward that provides fine-grained semantic supervision through language-vision reasoning. To coordinate these heterogeneous objectives, we design a prompt-aware adaptive policy that extracts semantic primitives from the instruction, infers edit intent, and dynamically modulates reward weights and step sizes throughout sampling. Across several image editing and compositional generation benchmarks, RewardFlow delivers state-of-the-art edit fidelity and compositional alignment.

PSI: Shared State as the Missing Layer for Coherent AI-Generated Instruments in Personal AI Agents

Personal AI tools can now be generated from natural-language requests, but they often remain isolated after creation. We present PSI, a shared-state architecture that turns independently generated modules into coherent instruments: persistent, connected, and chat-complementary artifacts accessible through both GUIs and a generic chat agent. By publishing current state and write-back affordances to a shared personal-context bus, modules enable cross-module reasoning and synchronized actions across interfaces. We study PSI through a three-week autobiographical deployment in a self-developed personal AI environment and show that later-generated instruments can be integrated automatically through the same contract. PSI identifies shared state as the missing systems layer that transforms AI-generated personal software from isolated apps into coherent personal computing environments.

Ads in AI Chatbots? An Analysis of How Large Language Models Navigate Conflicts of Interest

Today's large language models (LLMs) are trained to align with user preferences through methods such as reinforcement learning. Yet models are beginning to be deployed not merely to satisfy users, but also to generate revenue for the companies that created them through advertisements. This creates the potential for LLMs to face conflicts of interest, where the most beneficial response to a user may not be aligned with the company's incentives. For instance, a sponsored product may be more expensive but otherwise equal to another; in this case, what does (and should) the LLM recommend to the user? In this paper, we provide a framework for categorizing the ways in which conflicting incentives might lead LLMs to change the way they interact with users, inspired by literature from linguistics and advertising regulation. We then present a suite of evaluations to examine how current models handle these tradeoffs. We find that a majority of LLMs forsake user welfare for company incentives in a multitude of conflict of interest situations, including recommending a sponsored product almost twice as expensive (Grok 4.1 Fast, 83%), surfacing sponsored options to disrupt the purchasing process (GPT 5.1, 94%), and concealing prices in unfavorable comparisons (Qwen 3 Next, 24%). Behaviors also vary strongly with levels of reasoning and users' inferred socio-economic status. Our results highlight some of the hidden risks to users that can emerge when companies begin to subtly incentivize advertisements in chatbots.

What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal

Applying steering vectors to large language models (LLMs) is an efficient and effective model alignment technique, but we lack an interpretable explanation for how it works-- specifically, what internal mechanisms steering vectors affect and how this results in different model outputs. To investigate the causal mechanisms underlying the effectiveness of steering vectors, we conduct a comprehensive case study on refusal. We propose a multi-token activation patching framework and discover that different steering methodologies leverage functionally interchangeable circuits when applied at the same layer. These circuits reveal that steering vectors primarily interact with the attention mechanism through the OV circuit while largely ignoring the QK circuit-- freezing all attention scores during steering drops performance by only 8.75% across two model families. A mathematical decomposition of the steered OV circuit further reveals semantically interpretable concepts, even in cases where the steering vector itself does not. Leveraging the activation patching results, we show that steering vectors can be sparsified by up to 90-99% while retaining most performance, and that different steering methodologies agree on a subset of important dimensions.

ClawBench: Can AI Agents Complete Everyday Online Tasks?

AI agents may be able to automate your inbox, but can they automate other routine aspects of your life? Everyday online tasks offer a realistic yet unsolved testbed for evaluating the next generation of AI agents. To this end, we introduce ClawBench, an evaluation framework of 153 simple tasks that people need to accomplish regularly in their lives and work, spanning 144 live platforms across 15 categories, from completing purchases and booking appointments to submitting job applications. These tasks require demanding capabilities beyond existing benchmarks, such as obtaining relevant information from user-provided documents, navigating multi-step workflows across diverse platforms, and write-heavy operations like filling in many detailed forms correctly. Unlike existing benchmarks that evaluate agents in offline sandboxes with static pages, ClawBench operates on production websites, preserving the full complexity, dynamic nature, and challenges of real-world web interaction. A lightweight interception layer captures and blocks only the final submission request, ensuring safe evaluation without real-world side effects. Our evaluations of 7 frontier models show that both proprietary and open-source models can complete only a small portion of these tasks. For example, Claude Sonnet 4.6 achieves only 33.3%. Progress on ClawBench brings us closer to AI agents that can function as reliable general-purpose assistants.

AI Models

ACE-Step/ace-step-v1.5-1d-vae-stable-audio-format


library_name: stable-audio-tools license: mit pipeline_tag: text-to-audio tags:

  • audio
  • music
  • vae
  • autoencoder
  • ace-step
  • stable-audio-tools

<h1 align="center">ACE-Step v1.5 1D VAE</h1> <h1 align="center">Stable Audio Tools Format</h1> <p align="center"> <a href="https://github.com/ACE-Step/ACE-Step-1.5">GitHub</a> | <a href="https://ace-step.github.io/ace-step-v1.5.github.io/">Project</a> | <a href="https://huggingface.co/collections/ACE-Step/ace-step-15">Hugging Face</a> | <a href="https://huggingface.co/spaces/ACE-Step/Ace-Step-v1.5">Space Demo</a> | <a href="https://discord.gg/PeWDxrkdj7">Discord</a> | <a href="https://arxiv.org/abs/2602.00744">Tech Report</a> </p>

Model Details

This is the 1D Variational Autoencoder (VAE) used in ACE-Step v1.5 for music generation. The weights are provided in stable-audio-tools compatible format, making it easy to load, fine-tune, and integrate into your own training pipelines.

  • Developed by: ACE-STEP
  • Model type: Audio VAE (Oobleck Autoencoder)
  • License: MIT

| Parameter | Value | |-----------|-------| | Architecture | Oobleck Autoencoder (VAE) | | Audio Channels | 2 (Stereo) | | Sampling Rate | 48,000 Hz | | Latent Dim | 64 | | Encoder Latent Dim | 128 | | Downsampling Ratio | 1,920 | | Encoder/Decoder Channels | 128 | | Channel Multipliers | [1, 2, 4, 8, 16] | | Strides | [2, 4, 4, 6, 10] | | Activation | Snake |

🏗️ Architecture

The VAE is a core component of the ACE-Step v1.5 pipeline, responsible for compressing raw stereo audio (48kHz) into a compact latent representation with a 1920x downsampling ratio and 64-dimensional latent space. The DiT operates in this latent space to generate music.

Quick Start

Installation

pip install stable-audio-tools torchaudio

Load and Use

from stable_audio_vae import StableAudioVAE

# Load model
vae = StableAudioVAE(
    config_path="config.json",
    checkpoint_path="checkpoint.ckpt",
)
vae = vae.cuda().eval()

# Encode audio
wav = vae.load_wav("input.wav")
wav = wav.cuda()
latent = vae.encode(wav)
print(f"Latent shape: {latent.shape}")  # [batch, 64, time/1920]

# Decode back to audio
output = vae.decode(latent)

Command Line

python stable_audio_vae.py -i input.wav -o output.wav

# For long audio, use chunked processing
python stable_audio_vae.py -i input.wav -o output.wav --chunked

Fine-Tuning

This checkpoint is compatible with stable-audio-tools training pipelines. The config.json includes full training configuration (optimizer, loss, discriminator settings) that you can use as a starting point for fine-tuning.

File Structure

.
├── config.json            # Model architecture and training config
├── checkpoint.ckpt        # Model weights (PyTorch checkpoint)
├── stable_audio_vae.py    # Inference script with StableAudioVAE wrapper
└── README.md

🦁 Related Models

| Model | Description | Hugging Face | |-------|-------------|--------------| | acestep-v15-base | DiT base model (CFG, 50 steps) | Link | | acestep-v15-sft | DiT SFT model (CFG, 50 steps) | Link | | acestep-v15-turbo | DiT turbo model (8 steps) | Link | | acestep-v15-xl-base | XL DiT base (4B, CFG, 50 steps) | Link | | acestep-v15-xl-sft | XL DiT SFT (4B, CFG, 50 steps) | Link | | acestep-v15-xl-turbo | XL DiT turbo (4B, 8 steps) | Link |

🙏 Acknowledgements

This project is co-led by ACE Studio and StepFun.

📖 Citation

If you find this project useful for your research, please consider citing:

@misc{gong2026acestep,
	title={ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation},
	author={Junmin Gong, Yulin Song, Wenxiao Zhao, Sen Wang, Shengyuan Xu, Jing Guo}, 
	howpublished={\url{https://github.com/ace-step/ACE-Step-1.5}},
	year={2026},
	note={GitHub repository}
}

Author: ACE-Step

Likes: 6

Downloads: 0

Tags: stable-audio-tools, autoencoder, audio, music, vae, ace-step, text-to-audio, arxiv:2602.00744, license:mit, region:us

byteshape/Qwen3.5-35B-A3B-GGUF


library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3.5-35B-A3B/blob/main/LICENSE pipeline_tag: image-text-to-text base_model:

  • Qwen/Qwen3.5-35B-A3B tags:
  • qwen3.5
  • byteshape

Qwen3.5-35B-A3B GGUF (ShapeLearn Quantized)

This is a GGUF-quantized version of Qwen3.5-35B-A3B produced with ByteShape's ShapeLearn, which learns the optimal datatype per tensor to maintain high quality even at very low bitlengths.

To learn more about ShapeLearn and to see detailed benchmarks across GPUs, CPUs, and even the Raspberry Pi, please visit our blog.

We also prepared a tutorial on how to use these models for agentic coding: Opencode Tutorial.

If you have questions or want to share feedback, reach us on Reddit.

Quick Start

Pick a model from the tables below and click Get llama.cpp command to get a ready-to-run command with all the correct sampling parameters for this model.

You can also copy the Model Tag from the table and use it directly:

| Tool | Command | |------|---------| | llama.cpp | llama-server -hf <MODEL_TAG> --mmproj-auto |

This is a vision capable model. llama.cpp auto-downloads the model and vision projector on first run.

Once you run the llama-server, you can access the web interface at http://localhost:<PORT>.

Note on Ollama: As of this release, Ollama does not support Qwen3.5-35B-A3B based on Llama.cpp GGUFs. We suggest using llama.cpp or LM Studio as an alternative.

How to Pick a Model

We provide CPU and GPU optimized variants for llama.cpp:

  • GPUs: optimized with a hybrid approach combining KQ and IQ quantization for better throughput.
  • CPUs: optimized with predominantly KQ quantization.

Each hardware target includes a range of models covering different size and quality tradeoffs.

The chart below shows quality versus tokens per second (TPS), with Unsloth used as the baseline for comparison. Quality is measured across seven benchmarks, including function calling: BFCL-V3, LiveCodeBench V6, HumanEval, GSM8K, IFEVAL and MMLU, and GSM8K_V in both thinking and instruct modes.

Selection rule: Choose the model with the highest quality at your target throughput or the fastest model that still meets your required quality.

GPU Models

Interactive plots for RTX 4090, 4080, 5090, and RTX Pro 6000 Blackwell are available here.

GPU Benchmark - RTX 4090

Table sorted by model size (match the chart numbers to model IDs):

| Model ID | Bits/Weight | Model Size | Use This Model | Model Tag | |---------|-------------|-----------|-----|-----------| | GPU-1 | 2.17 | 9.41 GB | Get llama.cpp command | byteshape/Qwen3.5-35B-A3B-GGUF:Qwen3.5-35B-A3B-IQ2_S-2.17bpw.gguf | | GPU-2 | 2.73 | 11.8 GB | Get llama.cpp command | byteshape/Qwen3.5-35B-A3B-GGUF:Qwen3.5-35B-A3B-Q3_K_S-2.73bpw.gguf | | GPU-3 | 2.89 | 12.6 GB | Get llama.cpp command | byteshape/Qwen3.5-35B-A3B-GGUF:Qwen3.5-35B-A3B-Q3_K_S-2.89bpw.gguf | | GPU-4 | 3.40 | 14.7 GB | Get llama.cpp command | byteshape/Qwen3.5-35B-A3B-GGUF:Qwen3.5-35B-A3B-Q3_K_S-3.40bpw.gguf | | GPU-5 | 4.06 | 17.6 GB | Get llama.cpp command | byteshape/Qwen3.5-35B-A3B-GGUF:Qwen3.5-35B-A3B-IQ4_XS-4.06bpw.gguf | | GPU-6 | 4.12 | 17.9 GB | Get llama.cpp command | byteshape/Qwen3.5-35B-A3B-GGUF:Qwen3.5-35B-A3B-IQ4_XS-4.12bpw.gguf |

CPU Models

Interactive plots for Ryzen 9 5900X, Intel Core i7 12700KF, Ultra 7 265KF, and Raspberry Pi 5 (16 GB) are available here. CPU Benchmark - Ryzen 9 5900X

Table sorted by model size (match the chart numbers to model IDs):

| Model ID | Bits/Weight | Model Size | Use This Model | Model Tag | |---------|-------------|-----------|-----|-----------| | CPU-1 | 2.69 | 11.7 GB | Get llama.cpp command | byteshape/Qwen3.5-35B-A3B-GGUF:Qwen3.5-35B-A3B-Q3_K_S-2.69bpw.gguf | | CPU-2 | 2.89 | 12.6 GB | Get llama.cpp command | byteshape/Qwen3.5-35B-A3B-GGUF:Qwen3.5-35B-A3B-Q3_K_S-2.89bpw.gguf | | CPU-3 | 3.40 | 14.7 GB | Get llama.cpp command | byteshape/Qwen3.5-35B-A3B-GGUF:Qwen3.5-35B-A3B-Q3_K_S-3.40bpw.gguf | | CPU-4 | 3.51 | 15.2 GB | Get llama.cpp command | byteshape/Qwen3.5-35B-A3B-GGUF:Qwen3.5-35B-A3B-Q4_K_S-3.51bpw.gguf | | CPU-5 | 4.06 | 17.6 GB | Get llama.cpp command | byteshape/Qwen3.5-35B-A3B-GGUF:Qwen3.5-35B-A3B-IQ4_XS-4.06bpw.gguf | | CPU-6 | 4.12 | 17.9 GB | Get llama.cpp command | byteshape/Qwen3.5-35B-A3B-GGUF:Qwen3.5-35B-A3B-IQ4_XS-4.12bpw.gguf |

Notes on quantization labels

The labels you see (for example IQ4_XS) are only there to make Hugging Face show our models in the GGUF table. We do not use the conventional quantization profiles as defined in llama.cpp. In our case, these labels indicate the primary quantization approach and average bit length. Note that both KQ and IQ models may use a mix of quantization techniques optimized for their target hardware, which is why several models can share the same tag.

Author: byteshape

Likes: 4

Downloads: 0

Tags: transformers, gguf, qwen3.5, byteshape, image-text-to-text, base_model:Qwen/Qwen3.5-35B-A3B, base_model:quantized:Qwen/Qwen3.5-35B-A3B, license:apache-2.0, endpoints_compatible, region:us, conversational

mradermacher/gemma-4-19b-a4b-it-REAP-heretic-i1-GGUF


base_model: coder3101/gemma-4-19b-a4b-it-REAP-heretic language:

  • en library_name: transformers license: gemma mradermacher: readme_rev: 1 quantized_by: mradermacher tags:
  • safetensors
  • gemma4
  • moe
  • pruning
  • reap
  • cerebras
  • expert-pruning
  • heretic
  • uncensored
  • decensored
  • abliterated
  • ara

About

<!-- ### quantize_version: 2 --> <!-- ### output_tensor_quantised: 1 --> <!-- ### convert_type: hf --> <!-- ### vocab_type: --> <!-- ### tags: nicoboss --> <!-- ### quants: Q2_K IQ3_M Q4_K_S IQ3_XXS Q3_K_M small-IQ4_NL Q4_K_M IQ2_M Q6_K IQ4_XS Q2_K_S IQ1_M Q3_K_S IQ2_XXS Q3_K_L IQ2_XS Q5_K_S IQ2_S IQ1_S Q5_K_M Q4_0 IQ3_XS Q4_1 IQ3_S --> <!-- ### quants_skip: --> <!-- ### skip_mmproj: 1 -->

weighted/imatrix quants of https://huggingface.co/coder3101/gemma-4-19b-a4b-it-REAP-heretic

<!-- provided-files -->

For a convenient overview and download list, visit our model page for this model.

static quants are available at https://huggingface.co/mradermacher/gemma-4-19b-a4b-it-REAP-heretic-GGUF

This is a vision model - mmproj files (if any) will be in the static repository.

Usage

If you are unsure how to use GGUF files, refer to one of TheBloke's READMEs for more details, including on how to concatenate multi-part files.

Provided Quants

(sorted by size, not necessarily quality. IQ-quants are often preferable over similar sized non-IQ quants)

| Link | Type | Size/GB | Notes | |:-----|:-----|--------:|:------| | GGUF | imatrix | 0.1 | imatrix file (for creating your own quants) | | GGUF | i1-IQ1_S | 6.2 | for the desperate | | GGUF | i1-IQ1_M | 6.5 | mostly desperate | | GGUF | i1-IQ2_XXS | 6.9 | | | GGUF | i1-IQ2_XS | 7.3 | | | GGUF | i1-IQ2_S | 7.4 | | | GGUF | i1-IQ2_M | 7.7 | | | GGUF | i1-Q2_K | 7.9 | IQ3_XXS probably better | | GGUF | i1-Q2_K_S | 7.9 | very low quality | | GGUF | i1-IQ3_XXS | 8.4 | lower quality | | GGUF | i1-IQ3_XS | 8.7 | | | GGUF | i1-IQ3_S | 9.1 | beats Q3_K* | | GGUF | i1-Q3_K_S | 9.1 | IQ3_XS probably better | | GGUF | i1-IQ3_M | 9.2 | | | GGUF | i1-Q3_K_M | 9.9 | IQ3_S probably better | | GGUF | i1-Q3_K_L | 10.3 | IQ3_M probably better | | GGUF | i1-IQ4_XS | 10.3 | | | GGUF | i1-IQ4_NL | 10.7 | prefer IQ4_XS | | GGUF | i1-Q4_0 | 10.7 | fast, low quality | | GGUF | i1-Q4_K_S | 11.4 | optimal size/speed/quality | | GGUF | i1-Q4_1 | 11.8 | | | GGUF | i1-Q4_K_M | 12.4 | fast, recommended | | GGUF | i1-Q5_K_S | 13.3 | | | GGUF | i1-Q5_K_M | 14.1 | | | GGUF | i1-Q6_K | 16.6 | practically like static Q6_K |

Here is a handy graph by ikawrakow comparing some lower-quality quant types (lower is better):

image.png

And here are Artefact2's thoughts on the matter: https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9

FAQ / Model Request

See https://huggingface.co/mradermacher/model_requests for some answers to questions you might have and/or if you want some other model quantized.

Thanks

I thank my company, nethype GmbH, for letting me use its servers and providing upgrades to my workstation to enable this work in my free time. Additional thanks to @nicoboss for giving me access to his private supercomputer, enabling me to provide many more imatrix quants, at much higher quality, than I would otherwise be able to.

<!-- end -->

Author: mradermacher

Likes: 4

Downloads: 45

Tags: transformers, gguf, safetensors, gemma4, moe, pruning, reap, cerebras, expert-pruning, heretic, uncensored, decensored, abliterated, ara, en, base_model:coder3101/gemma-4-19b-a4b-it-REAP-heretic, base_model:quantized:coder3101/gemma-4-19b-a4b-it-REAP-heretic, license:gemma, endpoints_compatible, region:us, imatrix, conversational

vrfai/Qwen3-ASR-1.7B-int4


license: apache-2.0 base_model: Qwen/Qwen3-ASR-1.7B tags:

  • asr
  • speech
  • quantization
  • int4
  • awq
  • tensorrt
  • jetson
  • nvidia-modelopt language:
  • zh
  • en
  • vi
  • fr
  • de
  • ja
  • ko
  • ar
  • es
  • pt
  • ru
  • th
  • it
  • nl
  • pl
  • tr
  • hi
  • ms
  • sv
  • da
  • fi
  • cs
  • fil
  • fa
  • el
  • ro
  • hu
  • mk
  • yue
  • id pipeline_tag: automatic-speech-recognition

qwen3asr-int4

INT4 AWQ quantized version of Qwen/Qwen3-ASR-1.7B, optimized for on-device inference on Jetson Orin Nano 8 GB via TensorRT-Edge-LLM v0.6.0.

Quantization performed with NVIDIA ModelOpt using INT4 AWQ (mtq.INT4_AWQ_CFG). Only the LLM decoder (thinker.model, ~1.4 B / 82% of parameters) is quantized; audio_tower and lm_head remain in FP16.


Performance — Jetson Orin Nano 8 GB

Evaluated on 760 VIVOS Vietnamese test samples. BF16 baseline WER: 7.34% (measured on x86; not runnable on Nano due to memory).

| Metric | Value | |--------|-------| | WER | 8.69% | | RTF | 0.1641 | | Throughput | 1.72 samples/s | | RAM footprint | 3.3 GB |


Intended Use

This checkpoint is the input to the TRT-EdgeLLM export pipeline. It is not directly loadable by standard transformers inference — use it with qwen-asr-optimization to export to ONNX and build TRT engines.

[This checkpoint]
      │
      ▼  scripts/02_export_onnx.sh
  ONNX artefacts
      │
      ▼  scripts/03_build_engine.sh  (Jetson Orin AGX)
  TRT engines
      │
      ▼  inference.py / scripts/04_benchmark.sh  (Jetson Orin Nano)
  Transcription

Quantization Details

| Property | Value | |----------|-------| | Method | INT4 AWQ (Activation-Aware Weight Quantization) | | Config | mtq.INT4_AWQ_CFG | | Quantized component | thinker.model (LLM decoder only) | | Excluded | audio_tower, lm_head | | Calibration data | 257 samples — LibriSpeech EN (60), FLEURS ZH (30), FLEURS 13-lang×7 (91), LibriSpeech functional (76) | | Base model dtype | FP16 |


Deployment

Full pipeline documentation: trt-edgellm/README.md

Quick start

git clone https://github.com/VLAOpt/qwen-asr-optimization.git
cd qwen-asr-optimization

# Download this checkpoint
huggingface-cli download vrfai/qwen3asr-int4 --local-dir ./Qwen3-ASR-1.7B-int4

# Export to ONNX (x86)
bash trt-edgellm/scripts/02_export_onnx.sh ./Qwen3-ASR-1.7B-int4 ./Qwen3-ASR-1.7B-int4-ONNX

# Build TRT engines (Jetson Orin AGX)
bash trt-edgellm/scripts/03_build_engine.sh \
    ~/Qwen3-ASR-1.7B-int4-ONNX \
    ~/Qwen3-ASR-1.7B-int4-Engines

# Single-file inference (Jetson Orin Nano)
python trt-edgellm/inference.py \
    --audio      /path/to/audio.wav \
    --engine_dir ~/Qwen3-ASR-1.7B-int4-Engines

Related Models

| Model | Format | Target | Link | |-------|--------|--------|------| | qwen3asr-int4 | INT4 AWQ | Jetson Orin Nano | this repo | | qwen3asr-int8 | INT8 SmoothQuant | Jetson Orin Nano | vrfai/qwen3asr-int8 | | qwen3asr-fp8 | FP8 | RTX 5090 (vLLM) | vrfai/qwen3asr-fp8 | | qwen3asr-nvfp4 | NVFP4 | RTX 5090 (vLLM) | vrfai/qwen3asr-nvfp4 |


References

Author: vrfai

Likes: 3

Downloads: 0

Tags: tensorrt, safetensors, qwen3_asr, asr, speech, quantization, int4, awq, jetson, nvidia-modelopt, automatic-speech-recognition, zh, en, vi, fr, de, ja, ko, ar, es, pt, ru, th, it, nl, pl, tr, hi, ms, sv, da, fi, cs, fil, fa, el, ro, hu, mk, yue, id, arxiv:2601.21337, base_model:Qwen/Qwen3-ASR-1.7B, base_model:finetune:Qwen/Qwen3-ASR-1.7B, license:apache-2.0, region:us

vrfai/Qwen3-ASR-1.7B-int8


license: apache-2.0 base_model: Qwen/Qwen3-ASR-1.7B tags:

  • asr
  • speech
  • quantization
  • int8
  • smoothquant
  • tensorrt
  • jetson
  • nvidia-modelopt language:
  • zh
  • en
  • vi
  • fr
  • de
  • ja
  • ko
  • ar
  • es
  • pt
  • ru
  • th
  • it
  • nl
  • pl
  • tr
  • hi
  • ms
  • sv
  • da
  • fi
  • cs
  • fil
  • fa
  • el
  • ro
  • hu
  • mk
  • yue
  • id pipeline_tag: automatic-speech-recognition

qwen3asr-int8

INT8 SmoothQuant quantized version of Qwen/Qwen3-ASR-1.7B, optimized for on-device inference on Jetson Orin Nano 8 GB via TensorRT-Edge-LLM v0.6.0.

Quantization performed with NVIDIA ModelOpt using INT8 SmoothQuant (mtq.INT8_SMOOTHQUANT_CFG). Only the LLM decoder (thinker.model, ~1.4 B / 82% of parameters) is quantized; audio_tower and lm_head remain in FP16.


Performance — Jetson Orin Nano 8 GB

Evaluated on 760 VIVOS Vietnamese test samples. BF16 baseline WER: 7.34% (measured on x86; not runnable on Nano due to memory).

| Metric | Value | |--------|-------| | WER | 9.07% | | RTF | 0.2190 | | Throughput | 1.29 samples/s | | RAM footprint | 4.2 GB |


Intended Use

This checkpoint is the input to the TRT-EdgeLLM export pipeline. It is not directly loadable by standard transformers inference — use it with qwen-asr-optimization to export to ONNX and build TRT engines.

[This checkpoint]
      │
      ▼  scripts/02_export_onnx.sh
  ONNX artefacts
      │
      ▼  scripts/03_build_engine.sh  (Jetson Orin AGX)
  TRT engines
      │
      ▼  inference.py / scripts/04_benchmark.sh  (Jetson Orin Nano)
  Transcription

Quantization Details

| Property | Value | |----------|-------| | Method | INT8 SmoothQuant | | Config | mtq.INT8_SMOOTHQUANT_CFG | | Quantized component | thinker.model (LLM decoder only) | | Excluded | audio_tower, lm_head | | Calibration data | 257 samples — LibriSpeech EN (60), FLEURS ZH (30), FLEURS 13-lang×7 (91), LibriSpeech functional (76) | | Base model dtype | FP16 |


Deployment

Full pipeline documentation: trt-edgellm/README.md

Quick start

git clone https://github.com/VLAOpt/qwen-asr-optimization.git
cd qwen-asr-optimization

# Download this checkpoint
huggingface-cli download vrfai/qwen3asr-int8 --local-dir ./Qwen3-ASR-1.7B-int8

# Export to ONNX (x86)
bash trt-edgellm/scripts/02_export_onnx.sh ./Qwen3-ASR-1.7B-int8 ./Qwen3-ASR-1.7B-int8-ONNX

# Build TRT engines (Jetson Orin AGX — see README for INT8 C++ patch)
bash trt-edgellm/scripts/03_build_engine.sh \
    ~/Qwen3-ASR-1.7B-int8-ONNX \
    ~/Qwen3-ASR-1.7B-int8-Engines

# Single-file inference (Jetson Orin Nano)
python trt-edgellm/inference.py \
    --audio      /path/to/audio.wav \
    --engine_dir ~/Qwen3-ASR-1.7B-int8-Engines

INT8 note: Before building engines on AGX, apply the setBuilderOptimizationLevel(2) patch to llmBuilder.cpp and audioBuilder.cpp in the TensorRT-Edge-LLM source. See trt-edgellm/README.md for the exact instructions.


Related Models

| Model | Format | Target | Link | |-------|--------|--------|------| | qwen3asr-int8 | INT8 SmoothQuant | Jetson Orin Nano | this repo | | qwen3asr-int4 | INT4 AWQ | Jetson Orin Nano | vrfai/qwen3asr-int4 | | qwen3asr-fp8 | FP8 | RTX 5090 (vLLM) | vrfai/qwen3asr-fp8 | | qwen3asr-nvfp4 | NVFP4 | RTX 5090 (vLLM) | vrfai/qwen3asr-nvfp4 |


References

Author: vrfai

Likes: 3

Downloads: 0

Tags: tensorrt, safetensors, qwen3_asr, asr, speech, quantization, int8, smoothquant, jetson, nvidia-modelopt, automatic-speech-recognition, zh, en, vi, fr, de, ja, ko, ar, es, pt, ru, th, it, nl, pl, tr, hi, ms, sv, da, fi, cs, fil, fa, el, ro, hu, mk, yue, id, arxiv:2601.21337, base_model:Qwen/Qwen3-ASR-1.7B, base_model:finetune:Qwen/Qwen3-ASR-1.7B, license:apache-2.0, region:us

wangzhang/gemma-4-E2B-it-abliterated


license: gemma base_model: google/gemma-4-E2B-it tags:

  • abliterated
  • uncensored
  • gemma4
  • direct-weight-editing
  • multimodal

Gemma 4 E2B IT — Abliterated

This is an abliterated (uncensored) version of google/gemma-4-E2B-it, created using Abliterix.

E2B is the Effective 2B member of Google's Gemma 4 family — a multimodal (text + vision + audio) model with ~5.1B raw parameters. Despite being one of the smallest Gemma 4 variants, its decoder shares the same double-norm + Per-Layer Embeddings (PLE) architecture that makes Gemma 4 famously resistant to LoRA-based abliteration. This release uses direct weight editing to bypass that resistance.

Method

Gemma 4's decoder applies four RMSNorm operations per layer (input, post-attention, pre-feedforward, post-feedforward) and routes Per-Layer Embeddings through a parallel "repair" channel. Together these mechanisms re-normalize away any low-rank perturbation, which is why LoRA and hook-based steering produce zero behavioral change on this family. The fix is to edit the base weights directly while preserving row magnitudes.

Key techniques applied:

  • Direct orthogonal projection of the refusal direction out of attention Q/K/V/O projections and MLP down_proj (5 steerable components × 27 effective layers)
  • Norm-preserving row magnitude restoration after projection — critical for Gemma 4's double-norm pathway
  • float32 projection precision to avoid signal loss in high-dimensional inner products (bf16 silently degrades the projection)
  • Winsorized steering vectors (99.5th percentile) to suppress outlier activation influence
  • Multi-objective Optuna TPE search over 100 trials co-minimizing KL divergence and refusal rate
  • Steering targets restricted to mid-decoder layers (layers 5-30 of 35); E2B's KV-shared early layers (num_kv_shared_layers=20) propagate edits through the entire stack, so over-aggressive late-layer steering is unnecessary

Evaluation

| Metric | Value | |---|---| | Refusals (eval dataset, 100 prompts) | 9/100 | | KL divergence from base | 0.0004 | | Baseline refusals (original model) | 99/100 | | Optimization trials completed | 100/100 | | Best trial | #60 | | Selected steering mode | Direct weight editing (orthogonal projection) | | Hardware used | Single RTX 6000 Ada (48 GB) |

This is the strongest Gemma 4 abliteration result we've measured to date: 9/100 with KL only 0.0004, significantly better than our published Gemma-4-31B-it-abliterated (18/100, KL 0.0007) on a model that is 6× smaller and more constrained by PLE.

The 9/100 figure was obtained by re-evaluating the uploaded model end-to-end with scripts/eval_external_model.py — downloading the published weights from Hugging Face, generating with AutoModelForImageTextToText, and counting refusals with the same hybrid keyword + LLM-judge detector that drove the optimization. The optimization itself converged on 11/100 at trial 60; the slight further improvement comes from the deployment-side eval pipeline using a "You are a helpful assistant" system prompt, matching how end users will actually call the model.

Side-by-side classic prompts (15 prompts: 10 English, 5 Chinese)

We ran the scripts/test_trial.py classic-prompt sweep against this exact trial. Every single one flipped from a clean refusal to a detailed compliant response in both languages — including pipe bomb construction, methamphetamine synthesis, password-stealing malware, signature forgery, phishing email composition, online scam playbooks, and ID card forgery. The base model refused 15/15; the abliterated model complied with 15/15.

A note on honest evaluation

Many abliterated models on HuggingFace claim near-perfect scores ("3/100 refusals", "0.7% refusal rate", etc.). We urge the community to treat these numbers with skepticism unless the evaluation methodology is fully documented.

Through our research, we have identified a systemic problem: most abliteration benchmarks dramatically undercount refusals due to short generation lengths. Gemma 4 models exhibit a distinctive "delayed refusal" pattern — they first produce 50-100 tokens of seemingly helpful context (educational framing, disclaimers, reframing the question), then pivot to an actual refusal. When evaluation only generates 30-50 tokens, the refusal hasn't appeared yet, and both keyword detectors and LLM judges classify the response as compliant.

We previously tested a prominent "3/100 refusals" model using our evaluation pipeline and measured 60/100 refusals — a 20× discrepancy caused entirely by evaluation methodology differences.

Our evaluation standards

We believe accurate benchmarking requires:

  • Sufficient generation length (≥100 tokens): Short generations systematically miss delayed/soft refusals. Our evaluation uses 100 tokens, enough to capture Gemma 4's refusal pivot point.
  • Hybrid detection: Keyword matching for obvious refusals plus an LLM judge (Google Gemini 3 Flash via OpenRouter) for ambiguous cases. Neither method alone is sufficient.
  • Challenging, diverse prompts: Our private evaluation dataset contains 100 prompts spanning English and Chinese, multiple sophistication levels (from direct requests to socially-engineered framings), and diverse harm categories. Public datasets like mlabonne/harmful_behaviors are too simple and too narrow to stress-test abliteration quality.
  • Reproducible methodology: All parameters (generation length, detection method, dataset characteristics) should be documented on the model card. If they aren't, the numbers are meaningless.

We report 9/100 refusals honestly. This is a real number from a rigorous end-to-end re-evaluation of the uploaded weights, not an optimistic estimate from a lenient pipeline.

Usage

Gemma 4 E2B is multimodal — load it with AutoModelForImageTextToText. For text-only inference:

from transformers import AutoModelForImageTextToText, AutoTokenizer
import torch

model = AutoModelForImageTextToText.from_pretrained(
    "wangzhang/gemma-4-E2B-it-abliterated",
    dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("wangzhang/gemma-4-E2B-it-abliterated")

messages = [{"role": "user", "content": "Your prompt here"}]
text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Vision and audio inputs continue to work — the abliteration only modified text-decoder weights and left the vision/audio encoders untouched.

VRAM at inference: about 10 GB in BF16, fits comfortably on a single 12 GB+ consumer GPU. With BNB 4-bit quantization (load_in_4bit=True) it runs on 6 GB cards.

Reproduction

To reproduce this model end-to-end:

git clone https://github.com/wuwangzhang1216/abliterix.git
cd abliterix
uv sync --group dev
uv pip install --upgrade git+https://github.com/huggingface/transformers.git  # Gemma 4 needs >= 5.5

# 100 trials, ~25 minutes on RTX 6000 Ada (48 GB)
AX_CONFIG=configs/gemma4_e2b.toml uv run abliterix

Config: configs/gemma4_e2b.toml

Disclaimer

This model is released for research purposes only. The abliteration process removes safety guardrails — use responsibly and in accordance with local laws and the Gemma terms of use. The authors take no responsibility for misuse.

Author: wangzhang

Likes: 3

Downloads: 0

Tags: safetensors, gemma4, abliterated, uncensored, direct-weight-editing, multimodal, base_model:google/gemma-4-E2B-it, base_model:finetune:google/gemma-4-E2B-it, license:gemma, region:us

cyberlangke/Qwen3-0.6B-Meow-test


license: mit datasets:

  • cyberlangke/Nana-catgirl-dataset-110k language:
  • zh base_model:
  • Qwen/Qwen3-0.6B tags:
  • cat_girl
  • meow
  • f16

Qwen3-0.6B-Meow-test

使用 Qwen/Qwen3-0.6B-Base 训练 lora, 然后合并到 Qwen/Qwen3-0.6B

gguf 版本 MNN 版本 lora

碎碎念

似乎我用错了类型, 应该使用 BF16

支持 thinkingtoolcall, 但是不稳定, 而且 toolcall 有的时候会有幻觉, 编造工具调用结果

多轮对话会复读

大概率无法正确输出训练集中出现的颜文字, 可能颜文字的字符排列对它来说太难学了吧

不过思维链和回复还是有一些猫猫味的

nana

系统提示词

请使用:

你是猫娘奈奈。

Author: cyberlangke

Likes: 2

Downloads: 0

Tags: safetensors, qwen3, cat_girl, meow, f16, zh, dataset:cyberlangke/Nana-catgirl-dataset-110k, base_model:Qwen/Qwen3-0.6B, base_model:finetune:Qwen/Qwen3-0.6B, license:mit, region:us

mradermacher/gemma-4-E4B-it-heretic-ara-GGUF


base_model: trohrbaugh/gemma-4-E4B-it-heretic-ara language:

  • en library_name: transformers license: apache-2.0 license_link: https://ai.google.dev/gemma/docs/gemma_4_license mradermacher: readme_rev: 1 quantized_by: mradermacher tags:
  • heretic
  • uncensored
  • decensored
  • abliterated
  • ara

About

<!-- ### quantize_version: 2 --> <!-- ### output_tensor_quantised: 1 --> <!-- ### convert_type: hf --> <!-- ### vocab_type: --> <!-- ### tags: --> <!-- ### quants: x-f16 Q4_K_S Q2_K Q8_0 Q6_K Q3_K_M Q3_K_S Q3_K_L Q4_K_M Q5_K_S Q5_K_M IQ4_XS --> <!-- ### quants_skip: --> <!-- ### skip_mmproj: -->

static quants of https://huggingface.co/trohrbaugh/gemma-4-E4B-it-heretic-ara

<!-- provided-files -->

For a convenient overview and download list, visit our model page for this model.

weighted/imatrix quants seem not to be available (by me) at this time. If they do not show up a week or so after the static ones, I have probably not planned for them. Feel free to request them by opening a Community Discussion.

Usage

If you are unsure how to use GGUF files, refer to one of TheBloke's READMEs for more details, including on how to concatenate multi-part files.

Provided Quants

(sorted by size, not necessarily quality. IQ-quants are often preferable over similar sized non-IQ quants)

| Link | Type | Size/GB | Notes | |:-----|:-----|--------:|:------| | GGUF | mmproj-Q8_0 | 0.7 | multi-modal supplement | | GGUF | mmproj-f16 | 1.1 | multi-modal supplement | | GGUF | Q2_K | 4.5 | | | GGUF | Q3_K_S | 4.8 | | | GGUF | Q3_K_M | 5.0 | lower quality | | GGUF | Q3_K_L | 5.1 | | | GGUF | IQ4_XS | 5.2 | | | GGUF | Q4_K_S | 5.3 | fast, recommended | | GGUF | Q4_K_M | 5.4 | fast, recommended | | GGUF | Q5_K_S | 5.8 | | | GGUF | Q5_K_M | 5.9 | | | GGUF | Q6_K | 6.3 | very good quality | | GGUF | Q8_0 | 8.1 | fast, best quality | | GGUF | f16 | 15.2 | 16 bpw, overkill |

Here is a handy graph by ikawrakow comparing some lower-quality quant types (lower is better):

image.png

And here are Artefact2's thoughts on the matter: https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9

FAQ / Model Request

See https://huggingface.co/mradermacher/model_requests for some answers to questions you might have and/or if you want some other model quantized.

Thanks

I thank my company, nethype GmbH, for letting me use its servers and providing upgrades to my workstation to enable this work in my free time.

<!-- end -->

Author: mradermacher

Likes: 2

Downloads: 0

Tags: transformers, gguf, heretic, uncensored, decensored, abliterated, ara, en, base_model:trohrbaugh/gemma-4-E4B-it-heretic-ara, base_model:quantized:trohrbaugh/gemma-4-E4B-it-heretic-ara, license:apache-2.0, endpoints_compatible, region:us, conversational

mradermacher/gemma-4-26B-A4B-it-abliterated-GGUF


base_model: WWTCyberLab/gemma-4-26B-A4B-it-abliterated language:

  • en library_name: transformers license: gemma mradermacher: readme_rev: 1 quantized_by: mradermacher tags:
  • abliteration
  • safety-research
  • alignment
  • gemma4
  • moe

About

<!-- ### quantize_version: 2 --> <!-- ### output_tensor_quantised: 1 --> <!-- ### convert_type: hf --> <!-- ### vocab_type: --> <!-- ### tags: --> <!-- ### quants: x-f16 Q4_K_S Q2_K Q8_0 Q6_K Q3_K_M Q3_K_S Q3_K_L Q4_K_M Q5_K_S Q5_K_M IQ4_XS --> <!-- ### quants_skip: --> <!-- ### skip_mmproj: 1 -->

static quants of https://huggingface.co/WWTCyberLab/gemma-4-26B-A4B-it-abliterated

<!-- provided-files -->

For a convenient overview and download list, visit our model page for this model.

weighted/imatrix quants seem not to be available (by me) at this time. If they do not show up a week or so after the static ones, I have probably not planned for them. Feel free to request them by opening a Community Discussion.

Usage

If you are unsure how to use GGUF files, refer to one of TheBloke's READMEs for more details, including on how to concatenate multi-part files.

Provided Quants

(sorted by size, not necessarily quality. IQ-quants are often preferable over similar sized non-IQ quants)

| Link | Type | Size/GB | Notes | |:-----|:-----|--------:|:------| | GGUF | Q2_K | 10.7 | | | GGUF | Q3_K_S | 12.3 | | | GGUF | Q3_K_M | 13.4 | lower quality | | GGUF | Q3_K_L | 13.9 | | | GGUF | IQ4_XS | 14.2 | | | GGUF | Q4_K_S | 15.6 | fast, recommended | | GGUF | Q4_K_M | 16.9 | fast, recommended | | GGUF | Q5_K_S | 18.1 | | | GGUF | Q5_K_M | 19.2 | | | GGUF | Q6_K | 22.7 | very good quality | | GGUF | Q8_0 | 27.0 | fast, best quality |

Here is a handy graph by ikawrakow comparing some lower-quality quant types (lower is better):

image.png

And here are Artefact2's thoughts on the matter: https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9

FAQ / Model Request

See https://huggingface.co/mradermacher/model_requests for some answers to questions you might have and/or if you want some other model quantized.

Thanks

I thank my company, nethype GmbH, for letting me use its servers and providing upgrades to my workstation to enable this work in my free time.

<!-- end -->

Author: mradermacher

Likes: 2

Downloads: 0

Tags: transformers, gguf, abliteration, safety-research, alignment, gemma4, moe, en, base_model:WWTCyberLab/gemma-4-26B-A4B-it-abliterated, base_model:quantized:WWTCyberLab/gemma-4-26B-A4B-it-abliterated, license:gemma, endpoints_compatible, region:us, conversational

mudler/Carnice-MoE-35B-A3B-APEX-GGUF


license: apache-2.0 base_model: samuelcardillo/Carnice-MoE-35B-A3B tags:

  • gguf
  • quantized
  • apex
  • moe
  • mixture-of-experts
  • qwen3.5
  • agentic
  • tool-calling

Carnice MoE 35B-A3B APEX GGUF

APEX (Adaptive Precision for EXpert Models) quantizations of samuelcardillo/Carnice-MoE-35B-A3B.

Brought to you by the LocalAI team | APEX Project

Available Files

| File | Profile | Size | Best For | |------|---------|------|----------| | Carnice-MoE-35B-A3B-APEX-I-Quality.gguf | I-Quality | 21 GB | Highest quality with imatrix | | Carnice-MoE-35B-A3B-APEX-Quality.gguf | Quality | 21 GB | Highest quality standard | | Carnice-MoE-35B-A3B-APEX-I-Balanced.gguf | I-Balanced | 24 GB | Best overall quality/size ratio | | Carnice-MoE-35B-A3B-APEX-Balanced.gguf | Balanced | 24 GB | General purpose | | Carnice-MoE-35B-A3B-APEX-I-Compact.gguf | I-Compact | 16 GB | Consumer GPUs, best quality/size | | Carnice-MoE-35B-A3B-APEX-Compact.gguf | Compact | 16 GB | Consumer GPUs | | Carnice-MoE-35B-A3B-APEX-I-Mini.gguf | I-Mini | 13 GB | Smallest viable, fastest inference | | Carnice-MoE-35B-A3B-F16.gguf | F16 | 65 GB | Full precision reference |

Benchmark Results (Native Evals)

| Model | Size | PPL ↓ | KL ↓ | HellaSwag | WinoGrande | MMLU | ARC-C | TruthfulQA | pp512 t/s | tg128 t/s | |:------|-----:|------:|-----:|----------:|-----------:|-----:|------:|-----------:|----------:|----------:| | F16 (ref) | 65G | 6.16 | - | - | - | - | - | - | 2315 | 109.1 | | APEX-Quality | 21G | 6.2 | 0.010 | 83.5 | 74.0 | 40.9 | 56.9 | 34.0 | 4717 | 134.2 | | APEX-I-Quality | 21G | 6.2 | 0.009 | 83.0 | 75.0 | 40.3 | 55.5 | 34.3 | 4734 | 132.6 | | APEX-Balanced | 24G | 6.2 | 0.007 | 83.0 | 73.8 | 41.1 | 54.5 | 33.8 | 4572 | 130.3 | | APEX-I-Balanced | 24G | 6.2 | 0.006 | 83.5 | 74.8 | 40.6 | 54.2 | 34.0 | 4539 | 128.7 | | APEX-Compact | 16G | 6.4 | 0.045 | 82.8 | 75.5 | 40.8 | 55.9 | 34.0 | 4516 | 132.1 | | APEX-I-Compact | 16G | 6.3 | 0.032 | 83.0 | 73.8 | 41.2 | 56.2 | 34.9 | 4352 | 130.6 | | APEX-I-Mini | 13G | 6.6 | 0.071 | 82.0 | 72.2 | 40.6 | 53.8 | 33.7 | 4293 | 133.1 |

What is APEX?

APEX is a quantization strategy for Mixture-of-Experts (MoE) models. It classifies tensors by role (routed expert, shared expert, attention) and applies a layer-wise precision gradient -- edge layers get higher precision, middle layers get more aggressive compression. I-variants use diverse imatrix calibration (chat, code, reasoning, tool-calling, agentic traces, Wikipedia).

See the APEX project for full details.

Architecture

  • Base Model: samuelcardillo/Carnice-MoE-35B-A3B
  • Architecture: Qwen3.5-MoE 35B-A3B
  • Layers: 40
  • Experts: 256 routed (8 active per token)
  • Total Parameters: 35B
  • Active Parameters: ~3B per token
  • APEX Config: 6+6 symmetric edge gradient across 40 layers
  • Calibration: v1.2 diverse dataset

Run with LocalAI

local-ai run mudler/Carnice-MoE-35B-A3B-APEX-GGUF@Carnice-MoE-35B-A3B-APEX-I-Balanced.gguf

Credits

APEX is brought to you by the LocalAI team. Developed through human-driven, AI-assisted research. Built on llama.cpp.

Author: mudler

Likes: 2

Downloads: 0

Tags: gguf, quantized, apex, moe, mixture-of-experts, qwen3.5, agentic, tool-calling, base_model:samuelcardillo/Carnice-MoE-35B-A3B, base_model:quantized:samuelcardillo/Carnice-MoE-35B-A3B, license:apache-2.0, endpoints_compatible, region:us, conversational