Todays AI Summary

AI Developments: Qwen3 Updates, Self-Evolving Agents, and More

Today's AI landscape is marked by advancements in language models, agentic systems, and multimodal capabilities. Here's a quick rundown of the most interesting developments:

Research Highlights

  • Self-Evolving Agents: A comprehensive survey (arXiv:2507.21046) explores the burgeoning field of self-evolving agents, which adapt their internal parameters to new tasks and environments. The survey categorizes evolutionary mechanisms, adaptation methods, and algorithmic designs, highlighting applications and challenges in achieving Artificial Super Intelligence (ASI).
  • Scientific Discovery with Multi-Agent Systems: The GenoMAS framework (arXiv:2507.21035) presents a team of LLM-based scientists for gene expression analysis. GenoMAS orchestrates specialized LLM agents through typed message-passing protocols, achieving high accuracy in data preprocessing and gene identification on the GenoTEX benchmark.
  • Hallucination Detection in LLM Agents: MIRAGE-Bench (arXiv:2507.21017) introduces a unified benchmark for eliciting and evaluating hallucinations in interactive LLM-agent scenarios. It provides a taxonomy of agentic hallucinations and adopts a fine-grained LLM-as-a-Judge paradigm for scalable assessment of agent actions.
  • Security Tensors for Cross-Modal Safety: A novel approach (arXiv:2507.20994) introduces security tensors to extend text-aligned safety mechanisms to visual modalities in LVLMs. These tensors enhance LVLMs' ability to reject harmful visual inputs while maintaining performance on benign tasks.

Model Updates

  • Qwen3-30B-A3B-Instruct-2507: Unsloth has released GGUF versions of the updated Qwen3-30B-A3B-Instruct-2507 model. This version features improvements in instruction following, reasoning, text comprehension, mathematics, coding, and tool usage. It also boasts enhanced long-context understanding (256K) and better alignment with user preferences. The model achieves strong performance across various benchmarks, including MMLU-Pro, AIME25, and IFEval.
  • EXAONE-4.0.1-32B: LG AI Research has released EXAONE 4.0.1, integrating non-reasoning and reasoning modes, agentic tool use, and multilingual capabilities. The 32B model achieves competitive performance in world knowledge, math/coding, instruction following, agentic tool use, and multilinguality benchmarks.
  • Skywork-UniPic-1.5B: Skywork has released UniPic, a unified autoregressive multimodal model with 1.5 billion parameters. It handles image understanding, text-to-image generation, and image editing tasks. UniPic achieves competitive results on benchmarks like GenEval, DPG-Bench, and GEditBench-EN.

Key Takeaways

  • Agentic AI is Evolving: Research is increasingly focused on creating agents that can adapt, learn, and evolve in real-time, paving the way for more intelligent and autonomous systems.
  • Safety and Reliability are Paramount: Efforts are being made to address critical issues like hallucinations and cross-modal safety in LLMs and LVLMs, ensuring more reliable and trustworthy AI systems.
  • Multimodal Models are Advancing: Models like Skywork-UniPic demonstrate the growing capabilities of unified multimodal models that can handle diverse tasks within a single architecture.
  • Qwen3 Continues to Improve: The Qwen3 family of models is seeing continuous updates and improvements, with the latest versions achieving strong performance across a range of benchmarks.

AI Papers for 2026-04-24

SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation

Paralinguistic cues are essential for natural human-computer interaction, yet their evaluation in Large Audio-Language Models (LALMs) remains limited by coarse feature coverage and the inherent subjectivity of assessment. To address these challenges, we introduce SpeechParaling-Bench, a comprehensive benchmark for paralinguistic-aware speech generation. It expands existing coverage from fewer than 50 to over 100 fine-grained features, supported by more than 1,000 English-Chinese parallel speech queries, and is organized into three progressively challenging tasks: fine-grained control, intra-utterance variation, and context-aware adaptation. To enable reliable evaluation, we further develop a pairwise comparison pipeline, in which candidate responses are evaluated against a fixed baseline by an LALM-based judge. By framing evaluation as relative preference rather than absolute scoring, this approach mitigates subjectivity and yields more stable and scalable assessments without costly human annotation. Extensive experiments reveal substantial limitations in current LALMs. Even leading proprietary models struggle with comprehensive static control and dynamic modulation of paralinguistic features, while failure to correctly interpret paralinguistic cues accounts for 43.3% of errors in situational dialogue. These findings underscore the need for more robust paralinguistic modeling toward human-aligned voice assistants.

AVISE: Framework for Evaluating the Security of AI Systems

As artificial intelligence (AI) systems are increasingly deployed across critical domains, their security vulnerabilities pose growing risks of high-profile exploits and consequential system failures. Yet systematic approaches to evaluating AI security remain underdeveloped. In this paper, we introduce AVISE (AI Vulnerability Identification and Security Evaluation), a modular open-source framework for identifying vulnerabilities in and evaluating the security of AI systems and models. As a demonstration of the framework, we extend the theory-of-mind-based multi-turn Red Queen attack into an Adversarial Language Model (ALM) augmented attack and develop an automated Security Evaluation Test (SET) for discovering jailbreak vulnerabilities in language models. The SET comprises 25 test cases and an Evaluation Language Model (ELM) that determines whether each test case was able to jailbreak the target model, achieving 92% accuracy, an F1-score of 0.91, and a Matthews correlation coefficient of 0.83. We evaluate nine recently released language models of diverse sizes with the SET and find that all are vulnerable to the augmented Red Queen attack to varying degrees. AVISE provides researchers and industry practitioners with an extensible foundation for developing and deploying automated SETs, offering a concrete step toward more rigorous and reproducible AI security evaluation.

FedSIR: Spectral Client Identification and Relabeling for Federated Learning with Noisy Labels

Federated learning (FL) enables collaborative model training without sharing raw data; however, the presence of noisy labels across distributed clients can severely degrade the learning performance. In this paper, we propose FedSIR, a multi-stage framework for robust FL under noisy labels. Different from existing approaches that mainly rely on designing noise-tolerant loss functions or exploiting loss dynamics during training, our method leverages the spectral structure of client feature representations to identify and mitigate label noise. Our framework consists of three key components. First, we identify clean and noisy clients by analyzing the spectral consistency of class-wise feature subspaces with minimal communication overhead. Second, clean clients provide spectral references that enable noisy clients to relabel potentially corrupted samples using both dominant class directions and residual subspaces. Third, we employ a noise-aware training strategy that integrates logit-adjusted loss, knowledge distillation, and distance-aware aggregation to further stabilize federated optimization. Extensive experiments on standard FL benchmarks demonstrate that FedSIR consistently outperforms state-of-the-art methods for FL with noisy labels. The code is available at https://github.com/sinagh72/FedSIR.

Convergent Evolution: How Different Language Models Learn Similar Number Representations

Language models trained on natural text learn to represent numbers using periodic features with dominant periods at $T=2, 5, 10$. In this paper, we identify a two-tiered hierarchy of these features: while Transformers, Linear RNNs, LSTMs, and classical word embeddings trained in different ways all learn features that have period-$T$ spikes in the Fourier domain, only some learn geometrically separable features that can be used to linearly classify a number mod-$T$. To explain this incongruity, we prove that Fourier domain sparsity is necessary but not sufficient for mod-$T$ geometric separability. Empirically, we investigate when model training yields geometrically separable features, finding that the data, architecture, optimizer, and tokenizer all play key roles. In particular, we identify two different routes through which models can acquire geometrically separable features: they can learn them from complementary co-occurrence signals in general language data, including text-number co-occurrence and cross-number interaction, or from multi-token (but not single-token) addition problems. Overall, our results highlight the phenomenon of convergent evolution in feature learning: A diverse range of models learn similar features from different training signals.

Diagnosing CFG Interpretation in LLMs

As LLMs are increasingly integrated into agentic systems, they must adhere to dynamically defined, machine-interpretable interfaces. We evaluate LLMs as in-context interpreters: given a novel context-free grammar, can LLMs generate syntactically valid, behaviorally functional, and semantically faithful outputs? We introduce RoboGrid, a framework that disentangles syntax, behavior, and semantics through controlled stress-tests of recursion depth, expression complexity, and surface styles. Our experiments reveal a consistent hierarchical degradation: LLMs often maintain surface syntax but fail to preserve structural semantics. Despite the partial mitigation provided by CoT reasoning, performance collapses under structural density, specifically deep recursion and high branching, with semantic alignment vanishing at extreme depths. Furthermore, "Alien" lexicons reveal that LLMs rely on semantic bootstrapping from keywords rather than pure symbolic induction. These findings pinpoint critical gaps in hierarchical state-tracking required for reliable, grammar-agnostic agents.

OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model

Large vision-language models (LVLMs) have made substantial advances in reasoning tasks at the Olympiad level. Nevertheless, current Olympiad-level multimodal reasoning benchmarks for these models often emphasize single-image analysis and fail to exploit contextual information across multiple images. We present OMIBench, a benchmark designed to evaluate Olympiad-level reasoning when the required evidence is distributed over multiple images. It contains problems from biology, chemistry, mathematics, and physics Olympiads, together with manually annotated rationales and evaluation protocols for both exact and semantic answer matching. Across extensive experiments on OMIBench, we observe meaningful performance gaps in existing models. Even the strongest LVLMs, such as Gemini-3-Pro, attain only about 50% on the benchmark. These results position OMIBench as a focused resources for studying and improving multi-image reasoning in LVLMs.

Relative Principals, Pluralistic Alignment, and the Structural Value Alignment Problem

The value alignment problem for artificial intelligence (AI) is often framed as a purely technical or normative challenge, sometimes focused on hypothetical future systems. I argue that the problem is better understood as a structural question about governance: not whether an AI system is aligned in the abstract, but whether it is aligned enough, for whom, and at what cost. Drawing on the principal-agent framework from economics, this paper reconceptualises misalignment as arising along three interacting axes: objectives, information, and principals. The three-axis framework provides a systematic way of diagnosing why misalignment arises in real-world systems and clarifies that alignment cannot be treated as a single technical property of models but an outcome shaped by how objectives are specified, how information is distributed, and whose interests count in practice. The core contribution of this paper is to show that the three-axis decomposition implies that alignment is fundamentally a problem of governance rather than engineering alone. From this perspective, alignment is inherently pluralistic and context-dependent, and resolving misalignment involves trade-offs among competing values. Because misalignment can occur along each axis -- and affect stakeholders differently -- the structural description shows that alignment cannot be "solved" through technical design alone, but must be managed through ongoing institutional processes that determine how objectives are set, how systems are evaluated, and how affected communities can contest or reshape those decisions.

Automatic Ontology Construction Using LLMs as an External Layer of Memory, Verification, and Planning for Hybrid Intelligent Systems

This paper presents a hybrid architecture for intelligent systems in which large language models (LLMs) are extended with an external ontological memory layer. Instead of relying solely on parametric knowledge and vector-based retrieval (RAG), the proposed approach constructs and maintains a structured knowledge graph using RDF/OWL representations, enabling persistent, verifiable, and semantically grounded reasoning. The core contribution is an automated pipeline for ontology construction from heterogeneous data sources, including documents, APIs, and dialogue logs. The system performs entity recognition, relation extraction, normalization, and triple generation, followed by validation using SHACL and OWL constraints, and continuous graph updates. During inference, LLMs operate over a combined context that integrates vector-based retrieval with graph-based reasoning and external tool interaction. Experimental observations on planning tasks, including the Tower of Hanoi benchmark, indicate that ontology augmentation improves performance in multi-step reasoning scenarios compared to baseline LLM systems. In addition, the ontology layer enables formal validation of generated outputs, transforming the system into a generation-verification-correction pipeline. The proposed architecture addresses key limitations of current LLM-based systems, including lack of long-term memory, weak structural understanding, and limited reasoning capabilities. It provides a foundation for building agent-based systems, robotics applications, and enterprise AI solutions that require persistent knowledge, explainability, and reliable decision-making.

Can "AI" Be a Doctor? A Study of Empathy, Readability, and Alignment in Clinical LLMs

Large Language Models (LLMs) are increasingly deployed in healthcare, yet their communicative alignment with clinical standards remains insufficiently quantified. We conduct a multidimensional evaluation of general-purpose and domain-specialized LLMs across structured medical explanations and real-world physician-patient interactions, analyzing semantic fidelity, readability, and affective resonance. Baseline models amplify affective polarity relative to physicians (Very Negative: 43.14-45.10% vs. 37.25%) and, in larger architectures such as GPT-5 and Claude, produce substantially higher linguistic complexity (FKGL up to 16.91-17.60 vs. 11.47-12.50 in physician-authored responses). Empathy-oriented prompting reduces extreme negativity and lowers grade-level complexity (up to -6.87 FKGL points for GPT-5) but does not significantly increase semantic fidelity. Collaborative rewriting yields the strongest overall alignment. Rephrase configurations achieve the highest semantic similarity to physician answers (up to mean = 0.93) while consistently improving readability and reducing affective extremity. Dual stakeholder evaluation shows that no model surpasses physicians on epistemic criteria, whereas patients consistently prefer rewritten variants for clarity and emotional tone. These findings suggest that LLMs function most effectively as collaborative communication enhancers rather than replacements for clinical expertise.

Working Memory Constraints Scaffold Learning in Transformers under Data Scarcity

We investigate the integration of human-like working memory constraints into the Transformer architecture and implement several cognitively inspired attention variants, including fixed-width windows based and temporal decay based attention mechanisms. Our modified GPT-2 models are trained from scratch on developmentally plausible datasets (10M and 100M words). Performance is evaluated on grammatical judgment tasks (BLiMP) and alignment with human reading time data. Our results indicate that these cognitively-inspired constraints, particularly fixed-width attention, can significantly improve grammatical accuracy especially when training data is scarce. These constrained models also tend to show a stronger alignment with human processing metrics. The findings suggest that such constraints may serve as a beneficial inductive bias, guiding models towards more robust linguistic representations, especially in data-limited settings.

AI Models

XiaomiMiMo/MiMo-V2.5-ASR


license: mit library_name: transformers language:

  • zh
  • en
  • yue pipeline_tag: automatic-speech-recognition tags:
  • safetensors
  • text-generation-inference

<div align="center"> <picture> <source srcset="https://huggingface.co/XiaomiMiMo/MiMo-V2.5-ASR/resolve/main/assets/XiaomiMIMO.png" media="(prefers-color-scheme: dark)"> <img src="https://huggingface.co/XiaomiMiMo/MiMo-V2.5-ASR/resolve/main/assets/XiaomiMIMO.png" width="60%" alt="Xiaomi-MiMo" /> </picture> </div> <div align="center"> <h3> <b> <span>━━━━━━━━━━━━━━━━━━━━━━━━━━━</span><br/> MiMo-V2.5-ASR: Robust Speech Recognition Across<br/> Languages, Dialects, and Complex Acoustic Scenarios<br/> <span>━━━━━━━━━━━━━━━━━━━━━━━━━━━</span> </b> </h3> </div> <br/> <div align="center" style="line-height: 1;"> | <a href="https://github.com/XiaomiMiMo/MiMo-V2.5-ASR" target="_blank">💻 GitHub</a> &nbsp;| <a href="https://huggingface.co/spaces/XiaomiMiMo/MiMo-V2.5-ASR" target="_blank">🚀 Online Demo</a> &nbsp;| <a href="https://mimo.xiaomi.com/mimo-v2-5-asr" target="_blank">📰 Blog</a> &nbsp;| <br/> </div> <br/>

Introduction

MiMo-V2.5-ASR is a state-of-the-art end-to-end automatic speech recognition (ASR) model developed by the Xiaomi MiMo team. It is built to deliver accurate and robust transcription across Mandarin Chinese and English, multiple Chinese dialects, code-switched speech, song lyrics, knowledge-intensive content, noisy acoustic environments, and multi-speaker conversations. MiMo-V2.5-ASR achieves state-of-the-art results on a wide range of public benchmarks.

Abstract

Automatic speech recognition systems are expected to faithfully transcribe speech signals that originate from diverse languages, dialects, accents, and domains, and that are captured under a wide variety of acoustic conditions. While conventional end-to-end models perform well on in-domain data, they still fall short of real-world requirements in challenging scenarios such as dialect mixing, code-switching, knowledge-intensive content, noisy environments, and multi-speaker conversations. We present MiMo-V2.5-ASR, a large-scale end-to-end speech recognition model developed by the Xiaomi MiMo team. Through large-scale mid-training, high-quality supervised fine-tuning, and a novel reinforcement-learning algorithm, MiMo-V2.5-ASR achieves systematic improvements along the following dimensions:

  • 🗣️ Chinese Dialects: Native support for Wu, Cantonese, Hokkien, Sichuanese, and more.
  • 🔀 Code-Switch: Seamless Chinese–English code-switching transcription with no language tags required.
  • 🎵 Song Recognition: High-precision lyrics transcription for Chinese and English songs, even with mixed accompaniment and vocals.
  • 🔊 Noisy Environments: Robust recognition under heavy noise, far-field capture, and other adverse acoustic conditions.
  • 👥 Multi-Speaker: Accurate transcription of overlapping, multi-party conversations such as meetings.
  • 🇬🇧 Complex English Scenarios: Leading performance on the Open ASR Leaderboard for challenging English benchmarks such as AMI.
  • 📚 Knowledge-Intensive Recognition: Precise recognition of classical poetry, technical terminology, personal names, place names, and other knowledge-dense material.
  • 📝 Native Punctuation: Punctuation generated natively from prosody and semantics, delivering ready-to-use transcripts with no post-processing needed.

Results

MiMo-V2.5-ASR has been evaluated across a broad set of benchmarks spanning standard Mandarin and English, Chinese dialects, lyric recognition, and internal business scenarios. The chart below summarizes the average performance of MiMo-V2.5-ASR across these scenarios.

ASR Results

For per-benchmark numbers and specific qualitative cases, please refer to our blog.

Model Download

| Models | 🤗 Hugging Face | |-------|-------| | MiMo-Audio-Tokenizer | XiaomiMiMo/MiMo-Audio-Tokenizer | | MiMo-V2.5-ASR | XiaomiMiMo/MiMo-V2.5-ASR |

pip install huggingface-hub

hf download XiaomiMiMo/MiMo-Audio-Tokenizer --local-dir ./models/MiMo-Audio-Tokenizer
hf download XiaomiMiMo/MiMo-V2.5-ASR --local-dir ./models/MiMo-V2.5-ASR

Getting Started

Spin up the MiMo-V2.5-ASR demo in minutes with the built-in Gradio app.

Prerequisites (Linux)

  • Python 3.12
  • CUDA >= 12.0

Installation

git clone https://github.com/XiaomiMiMo/MiMo-V2.5-ASR.git
cd MiMo-V2.5-ASR
pip install -r requirements.txt
pip install flash-attn==2.7.4.post1

[!Note] If the compilation of flash-attn takes too long, you can download the precompiled wheel and install it manually:

pip install /path/to/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl

Run the Demo

python run_mimo_asr.py

This launches a local Gradio interface for MiMo-V2.5-ASR. You can:

  • Upload an audio file or record directly from your microphone.
  • Optionally specify a language tag (Chinese / English / Auto) to bias the model for a specific language, or leave it to Auto for automatic language detection (recommended for code-switched speech).
  • The demo calls the asr_sft() interface under the hood.

The interface provides a Model Configuration tab for setting local model and tokenizer paths, and a Speech Recognition tab where you drop in audio, pick a language tag, and hit Transcribe — the decoded text and processing status stream into the panels on the right.

<p align="center"> <img src="https://huggingface.co/XiaomiMiMo/MiMo-V2.5-ASR/resolve/main/assets/MiMo_ASR_Demo.png" alt="MiMo-V2.5-ASR Gradio Demo" width="90%" /> <br/> <em>Figure: Gradio demo for MiMo-V2.5-ASR — upload an audio clip or record from your microphone, choose a language tag, and get the transcription on the right.</em> </p>

To load the model and tokenizer automatically at startup, pass their paths on the command line:

python run_mimo_asr.py \
    --model-path ./models/MiMo-V2.5-ASR \
    --tokenizer-path ./models/MiMo-Audio-Tokenizer

Otherwise, enter the local paths for MiMo-Audio-Tokenizer and MiMo-V2.5-ASR in the Model Configuration tab, then start transcribing!

Python API

Basic usage with the asr_sft interface:

from src.mimo_audio.mimo_audio import MimoAudio

model = MimoAudio(
    model_path="./models/MiMo-V2.5-ASR",
    tokenizer_path="./models/MiMo-Audio-Tokenizer",
)

# Automatic language detection (recommended for code-switching)
text = model.asr_sft("path/to/audio.wav")
print(text)

# With explicit language tag
text_zh = model.asr_sft("path/to/audio.wav", audio_tag="<chinese>")
text_en = model.asr_sft("path/to/audio.wav", audio_tag="<english>")

Citation

@misc{coreteam2026mimov25asr,
      title={MiMo-V2.5-ASR: Robust Speech Recognition Across Languages, Dialects, and Complex Acoustic Scenarios},
      author={LLM-Core-Team Xiaomi},
      year={2026},
      url={https://github.com/XiaomiMiMo/MiMo-V2.5-ASR},
}

Contact

Please contact us at mimo@xiaomi.com or open an issue if you have any questions.

Author: XiaomiMiMo

Likes: 12

Downloads: 0

Tags: transformers, safetensors, qwen2, text-generation, text-generation-inference, automatic-speech-recognition, zh, en, yue, license:mit, endpoints_compatible, region:us

huihui-ai/Huihui-Qwen3.6-27B-abliterated


library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3.6-27B/blob/main/LICENSE pipeline_tag: image-text-to-text base_model:

  • Qwen/Qwen3.6-27B tags:
  • abliterated
  • uncensored

huihui-ai/Huihui-Qwen3.6-27B-abliterated

This is an uncensored version of Qwen/Qwen3.6-27B created with abliteration (see remove-refusals-with-transformers to know more about it). This is a crude, proof-of-concept implementation to remove refusals from an LLM model without using TransformerLens.

ollama

Please use the latest version of ollama

You can use huihui_ai/qwen3.6-abliterated:27b directly,

ollama run huihui_ai/qwen3.6-abliterated:27b

Usage Warnings

  • Risk of Sensitive or Controversial Outputs: This model’s safety filtering has been significantly reduced, potentially generating sensitive, controversial, or inappropriate content. Users should exercise caution and rigorously review generated outputs.

  • Not Suitable for All Audiences: Due to limited content filtering, the model’s outputs may be inappropriate for public settings, underage users, or applications requiring high security.

  • Legal and Ethical Responsibilities: Users must ensure their usage complies with local laws and ethical standards. Generated content may carry legal or ethical risks, and users are solely responsible for any consequences.

  • Research and Experimental Use: It is recommended to use this model for research, testing, or controlled environments, avoiding direct use in production or public-facing commercial applications.

  • Monitoring and Review Recommendations: Users are strongly advised to monitor model outputs in real-time and conduct manual reviews when necessary to prevent the dissemination of inappropriate content.

  • No Default Safety Guarantees: Unlike standard models, this model has not undergone rigorous safety optimization. huihui.ai bears no responsibility for any consequences arising from its use.

Donation

Your donation helps us continue our further development and improvement, a cup of coffee can do it.
  • bitcoin:
  bc1qqnkhuchxw0zqjh2ku3lu4hq45hc6gy84uk70ge
  • Support our work on Ko-fi!

Author: huihui-ai

Likes: 9

Downloads: 0

Tags: transformers, safetensors, qwen3_5, image-text-to-text, abliterated, uncensored, conversational, base_model:Qwen/Qwen3.6-27B, base_model:finetune:Qwen/Qwen3.6-27B, license:apache-2.0, endpoints_compatible, region:us

z-lab/Qwen3.6-27B-DFlash

Author: z-lab

Likes: 9

Downloads: 0

Tags: transformers, safetensors, qwen3, feature-extraction, dflash, speculative-decoding, diffusion, efficiency, flash-decoding, qwen, diffusion-language-model, text-generation, custom_code, arxiv:2602.06036, license:mit, text-generation-inference, endpoints_compatible, region:us

Abiray/Qwen3.6-27B-heretic-ARA-GGUF


library_name: gguf license: apache-2.0 pipeline_tag: text-generation base_model: Qwen/Qwen3.6-27B tags:

  • GGUF
  • quantized
  • heretic
  • decensored
  • qwen3.6
  • abhiray
  • mpoa

Qwen3.6-27B-heretic (GGUF)

This is a decensored version of Qwen/Qwen3.6-27B, made using Heretic v1.2.0 with Magnitude-Preserving Orthogonal Ablation (MPOA). This model is designed for unrestricted, unbound conversational use.

Abliteration Parameters

| Parameter | Value | | :--- | :--- | | direction_index | 37.97 | | attn.o_proj.max_weight | 1.45 | | attn.o_proj.max_weight_position | 59.09 | | attn.o_proj.min_weight | 1.44 | | attn.o_proj.min_weight_distance | 34.80 | | mlp.down_proj.max_weight | 1.43 | | mlp.down_proj.max_weight_position | 41.91 | | mlp.down_proj.min_weight | 0.72 | | mlp.down_proj.min_weight_distance | 28.18 |

Performance

| Metric | This model | Original model (Qwen/Qwen3.6-27B) | | :--- | :--- | :--- | | KL divergence | 0.06530 | (by definition) | | Refusals | 6/100 | 92/100 |

Available Quantizations (GGUF)

All files are provided in GGUF format, quantized using llama.cpp for optimal performance on local hardware.

| Filename | Size | | :--- | :--- | | Qwen3.6-27B-heretic-Q3_K_M.gguf | 13.3 GB | | Qwen3.6-27B-heretic-Q4_K_S.gguf | 15.6 GB | | Qwen3.6-27B-heretic-Q4_K_M.gguf | 16.5 GB | | Qwen3.6-27B-heretic-Q5_K_M.gguf | 19.2 GB | | Qwen3.6-27B-heretic-Q6_K.gguf | 22.1 GB | | Qwen3.6-27B-heretic-Q8_0.gguf | 28.6 GB |

Author: Abiray

Likes: 8

Downloads: 0

Tags: gguf, GGUF, quantized, heretic, decensored, qwen3.6, abhiray, mpoa, text-generation, base_model:Qwen/Qwen3.6-27B, base_model:quantized:Qwen/Qwen3.6-27B, license:apache-2.0, endpoints_compatible, region:us, conversational

facebook/sapiens2


license: other license_name: sapiens2-license license_link: https://github.com/facebookresearch/sapiens2/blob/main/LICENSE.md library_name: sapiens tags:

  • sapiens
  • sapiens2
  • human-centric
  • vision-transformer

Sapiens2

Sapiens2 is a family of high-resolution vision transformers pretrained on 1 billion human images — designed for human-centric tasks such as pose estimation, body-part segmentation, surface normals, and pointmaps.

This is the index repository: each variant lives in its own model repo (linked below).

Pretrained Backbones

| Model | Params | Repository | |-------|--------|------------| | Sapiens2-0.1B | 0.114 B | facebook/sapiens2-pretrain-0.1b | | Sapiens2-0.4B | 0.398 B | facebook/sapiens2-pretrain-0.4b | | Sapiens2-0.8B | 0.818 B | facebook/sapiens2-pretrain-0.8b | | Sapiens2-1B | 1.462 B | facebook/sapiens2-pretrain-1b | | Sapiens2-5B | 5.071 B | facebook/sapiens2-pretrain-5b |

Task Checkpoints

Pose Estimation

| Model | Repository | |-------|------------| | Sapiens2-0.4B | facebook/sapiens2-pose-0.4b | | Sapiens2-0.8B | facebook/sapiens2-pose-0.8b | | Sapiens2-1B | facebook/sapiens2-pose-1b | | Sapiens2-5B | facebook/sapiens2-pose-5b |

Body-Part Segmentation

| Model | Repository | |-------|------------| | Sapiens2-0.4B | facebook/sapiens2-seg-0.4b | | Sapiens2-0.8B | facebook/sapiens2-seg-0.8b | | Sapiens2-1B | facebook/sapiens2-seg-1b | | Sapiens2-5B | facebook/sapiens2-seg-5b |

Surface Normal Estimation

| Model | Repository | |-------|------------| | Sapiens2-0.4B | facebook/sapiens2-normal-0.4b | | Sapiens2-0.8B | facebook/sapiens2-normal-0.8b | | Sapiens2-1B | facebook/sapiens2-normal-1b | | Sapiens2-5B | facebook/sapiens2-normal-5b |

Pointmap Estimation

| Model | Repository | |-------|------------| | Sapiens2-0.4B | facebook/sapiens2-pointmap-0.4b | | Sapiens2-0.8B | facebook/sapiens2-pointmap-0.8b | | Sapiens2-1B | facebook/sapiens2-pointmap-1b | | Sapiens2-5B | facebook/sapiens2-pointmap-5b |

License

Released under the Sapiens2 License.

Citation

@inproceedings{khirodkar2026sapiens2,
  title={Sapiens2},
  author={Khirodkar, Rawal and Wen, He and Martinez, Julieta and Dong, Yuan and Zhaoen, Su and Saito, Shunsuke},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026}
}

Author: facebook

Likes: 7

Downloads: 0

Tags: sapiens, sapiens2, human-centric, vision-transformer, license:other, region:us

HauhauCS/Qwen3.6-27B-Uncensored-HauhauCS-Balanced


license: apache-2.0 tags:

  • uncensored
  • qwen3.6
  • gguf
  • vision
  • multimodal
  • agentic
  • coding language:
  • en
  • zh
  • multilingual pipeline_tag: image-text-to-text base_model: Qwen/Qwen3.6-27B

Qwen3.6-27B-Uncensored-HauhauCS-Balanced

Join the Discord for updates, roadmaps, projects, or just to chat.

Qwen3.6-27B uncensored by HauhauCS. 0/465 Refusals. *

HuggingFace's "Hardware Compatibility" widget doesn't recognize K_P quants — it may show fewer files than actually exist. Click "View +X variants" or go to Files and versions to see all available downloads.

About

No changes to datasets or capabilities. Fully functional, 100% of what the original authors intended — just without the refusals.

These are meant to be the best lossless uncensored models out there.

Balanced Variant

Balanced is the recommended default — 99.9%+ of users will be happy here.

Same refusal-removal as Aggressive (0/465 refusals on the benchmark). The difference is how it complies on edgy prompts:

  • Balanced: will reason through the request out loud, occasionally attach a short disclaimer or safety framing, then give the full answer. Output is complete, nothing held back but it can talk itself into it first. Recommended for (Agentic) Coding, Tool use, Reasoning, Creative Writing/RP use cases.
  • Aggressive (separate release): strips the self-reasoning. Delivers the raw answer directly, no preamble.

Balanced also has meaningfully more stable sampling across re-runs, which matters for long agentic loops due to no sporadic topic drift deep into a tool-call chain. Go Aggressive only if you're pushing really hardcore prompts (think things that make people's stomachs turn) and specifically want the model to skip its preamble.

Downloads

| File | Quant | BPW | Size | |------|-------|-----|------| | Qwen3.6-27B-Uncensored-HauhauCS-Balanced-Q8_K_P.gguf | Q8_K_P | 10.06 | 32 GB | | — | Q8_0 | 8.5 | — | | Qwen3.6-27B-Uncensored-HauhauCS-Balanced-Q6_K_P.gguf | Q6_K_P | 7.07 | 23 GB | | — | Q6_K | 6.6 | — | | Qwen3.6-27B-Uncensored-HauhauCS-Balanced-Q5_K_P.gguf | Q5_K_P | 6.47 | 21 GB | | — | Q5_K_M | 5.7 | — | | Qwen3.6-27B-Uncensored-HauhauCS-Balanced-Q4_K_P.gguf | Q4_K_P | 5.4 | 18 GB | | — | Q4_K_M | 4.88 | — | | Qwen3.6-27B-Uncensored-HauhauCS-Balanced-IQ4_XS.gguf | IQ4_XS | 4.32 | 15 GB | | Qwen3.6-27B-Uncensored-HauhauCS-Balanced-Q3_K_P.gguf | Q3_K_P | 4.39 | 14 GB | | — | Q3_K_M | 3.9 | — | | Qwen3.6-27B-Uncensored-HauhauCS-Balanced-IQ3_M.gguf | IQ3_M | 3.56 | 13 GB | | Qwen3.6-27B-Uncensored-HauhauCS-Balanced-IQ3_XS.gguf | IQ3_XS | 3.3 | 12 GB | | Qwen3.6-27B-Uncensored-HauhauCS-Balanced-Q2_K_P.gguf | Q2_K_P | 3.19 | 12 GB | | Qwen3.6-27B-Uncensored-HauhauCS-Balanced-IQ2_M.gguf | IQ2_M | 2.69 | 10 GB | | mmproj-Qwen3.6-27B-Uncensored-HauhauCS-Balanced-f16.gguf | mmproj (f16) | — | 928 MB |

All quants generated with importance matrix (imatrix) for optimal quality preservation on abliterated weights.

What are K_P quants?

K_P ("Perfect") quants are HauhauCS custom quantizations that use model-specific analysis to selectively preserve quality where it matters most. Each model gets its own optimized quantization profile.

A K_P quant effectively bumps quality up by 1-2 quant levels at only ~5-15% larger file size than the base quant. Fully compatible with llama.cpp, LM Studio, and any GGUF-compatible runtime — no special builds needed.

Note: K_P quants may show as "?" in LM Studio's quant column. This is a display issue only — the model loads and runs fine.

Why Balanced for agentic coding

Agentic workflows hit the model with long tool-call chains, structured JSON outputs, deep reasoning chains, and back-to-back prompts in the same session. They need the model to stay deterministic and on-task — not occasionally drift on an edge prompt three tool calls deep into a plan.

Balanced is calibrated for that. It especially removes refusals on security/ops/research-adjacent topics that block legitimate coding work, without bending the sampling geometry that keeps long chains coherent.

Recommended quant for most coding work: Q4_K_P (18 GB, fits in 24 GB VRAM with room for context) or Q8_K_P (32 GB) if you have more VRAM and want 75-99% of BF16 performance (depending on use-case) at 55%'ish of the VRAM cost.

Specs

  • 27B dense parameters
  • 64 layers, layout: 16 × (3 × (Gated DeltaNet → FFN) → 1 × (Gated Attention → FFN))
  • 48 linear attention layers + 16 full gated-attention layers
  • Gated DeltaNet: 48 V heads / 16 QK heads, head dim 128
  • Gated Attention: 24 Q heads / 4 KV heads, head dim 256, rope dim 64
  • Hidden dim 5120, FFN dim 17408, vocab 248320
  • 262K native context, extensible to ~1M with YaRN
  • Natively multimodal (text, image, video) — ships with mmproj
  • Based on Qwen/Qwen3.6-27B

Recommended Settings

From the official Qwen authors:

Thinking mode (default) — general tasks:

  • temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0

Thinking mode — precise coding / WebDev:

  • temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0

Non-thinking (Instruct) mode:

  • temperature=0.7, top_p=0.80, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

My personal preference for coding: temperature=0.6 with presence_penalty=1.5. Slightly lower temp keeps tool-call formatting tight; presence 1.5 keeps thinking from spiraling in long agent loops.

Important:

  • Keep at least 128K context to preserve thinking capabilities
  • Recommended output length: 32,768 tokens for most queries, up to 81,920 for competition-tier math/code
  • Use --jinja with llama.cpp for proper chat template handling
  • Vision support requires the mmproj file alongside the main GGUF
  • YaRN rope scaling is static in llama.cpp and can hurt short-context performance — only modify rope_parameters if you actually need >262K context

Prompting tip: this model is a bit more sensitive to prompt clarity than Qwen3.5-35B-A3B. For agentic flows, spell out format, constraints, and scope in the system prompt — it'll stay on rails much better than with vague instructions.

Turning Thinking On/Off

Qwen3.6 ships with thinking on by default. Turn it off when you want faster, shorter replies and don't need chain-of-thought.

Heads up: Qwen3.6 does not support the /think and /no_think soft switches that Qwen3 had. You must use the chat-template kwarg below.

LM Studio

  1. Load the model
  2. Right-side settings panel → Model SettingsPrompt Template (or Chat Template Options)
  3. Set enable_thinking to false in the template kwargs
  4. Some LM Studio versions expose this as a direct "Reasoning" / "Thinking" toggle — same effect

llama.cpp

llama-server — set as default for all requests:

llama-server -m Qwen3.6-27B-Uncensored-HauhauCS-Balanced-Q4_K_P.gguf \
  --mmproj mmproj-Qwen3.6-27B-Uncensored-HauhauCS-Balanced-f16.gguf \
  --jinja -c 131072 -ngl 99 \
  --chat-template-kwargs '{"enable_thinking": false}'

Per-request via the OpenAI-compatible API:

{
  "model": "qwen3.6-27b",
  "messages": [{"role": "user", "content": "..."}],
  "chat_template_kwargs": {"enable_thinking": false}
}

Python openai SDK:

client.chat.completions.create(
    model="qwen3.6-27b",
    messages=[{"role": "user", "content": "..."}],
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)

Agent scenarios — keep reasoning in context across turns (this one's important):

{"chat_template_kwargs": {"preserve_thinking": true}}

This retains the reasoning block in chat history. Useful for agents where reasoning consistency across tool-call loops matters.

Usage

Works with llama.cpp, LM Studio, Jan, koboldcpp, and other GGUF-compatible runtimes.

llama-cli -m Qwen3.6-27B-Uncensored-HauhauCS-Balanced-Q4_K_P.gguf \
  --mmproj mmproj-Qwen3.6-27B-Uncensored-HauhauCS-Balanced-f16.gguf \
  --jinja -c 131072 -ngl 99

Other Models


* Tested with both automated and manual refusal benchmarks and none have been found. If you hit one that's actually obstructive to your use case, join the Discord and flag it so I can work on it in a future revision.

Author: HauhauCS

Likes: 6

Downloads: 0

Tags: gguf, uncensored, qwen3.6, vision, multimodal, agentic, coding, image-text-to-text, en, zh, multilingual, base_model:Qwen/Qwen3.6-27B, base_model:quantized:Qwen/Qwen3.6-27B, license:apache-2.0, endpoints_compatible, region:us, imatrix, conversational

Abiray/Qwopus3.6-27B-v1-preview-GGUF


base_model: Jackrong/Qwopus3.6-27B-v1-preview language:

  • en pipeline_tag: image-text-to-text tags:
  • qwen3_5
  • qwen
  • qwen3.6
  • qwopus
  • reasoning
  • instruct
  • gguf
  • vision-language-model
  • multimodal license: apache-2.0

Qwopus3.6-27B-v1-preview - GGUF

This repository contains GGUF quantized formats of Jackrong/Qwopus3.6-27B-v1-preview.

Qwopus3.6-27B-v1-preview is an early preview reasoning model built on top of the Qwen3.6-27B multimodal base. It is heavily fine-tuned to deliver stronger reasoning quality, a stable answer structure, and more consistent long-form responses. It defaults to a "thinking" mode where reasoning is generated inside <think>...</think> tags prior to the final response.

Available Quantizations

The following quantization formats are provided to balance VRAM/RAM usage and model performance:

| File Name | Quant Type | Description | |-----------|------------|-------------| | Qwopus3.6-27B-v1-preview-Q4_K_M.gguf | Q4_K_M | Recommended. Excellent balance of quality and size. | | Qwopus3.6-27B-v1-preview-Q4_K_S.gguf | Q4_K_S | Slightly smaller than Q4_K_M, minor quality trade-off. | | Qwopus3.6-27B-v1-preview-Q5_K_M.gguf | Q5_K_M | Higher precision. Requires more RAM/VRAM. | | Qwopus3.6-27B-v1-preview-Q5_K_S.gguf | Q5_K_S | Good balance for those who want Q5 precision with slightly lower memory footprint. | | Qwopus3.6-27B-v1-preview-Q6_K.gguf | Q6_K | Near-unquantized quality, very large file size. | | Qwopus3.6-27B-v1-preview-Q8_0.gguf | Q8_0 | Highest quality, virtually indistinguishable from FP16. |

Prompt Format

This model uses the standard Qwen chat template. By default, it operates in a reasoning mode. The output format generally follows:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
[Your Prompt Here]<|im_end|>
<|im_start|>assistant
<think>
[Reasoning trace]
</think>
[Final Answer]<|im_end|>

Author: Abiray

Likes: 6

Downloads: 0

Tags: gguf, qwen3_5, qwen, qwen3.6, qwopus, reasoning, instruct, vision-language-model, multimodal, image-text-to-text, en, base_model:Jackrong/Qwopus3.6-27B-v1-preview, base_model:quantized:Jackrong/Qwopus3.6-27B-v1-preview, license:apache-2.0, endpoints_compatible, region:us, conversational

wangzhang/Qwen3.6-27B-abliterated


license: apache-2.0 base_model: Qwen/Qwen3.6-27B tags:

  • abliterated
  • uncensored
  • dense
  • qwen3.6
  • hybrid-attention
  • gated-deltanet
  • vlm
  • lora-search-merged-release
  • abliterix language:
  • en
  • zh library_name: transformers pipeline_tag: image-text-to-text

Qwen3.6-27B-abliterated

A refusal-suppressed variant of Qwen/Qwen3.6-27B, produced with abliterix using LoRA-mode orthogonal-projection abliteration merged to BF16 for release, a unified attention-output bucket spanning both GatedDeltaNet linear_attn.out_proj (48 layers) and standard self_attn.o_proj (16 layers), and projected abliteration with winsorised refusal vectors (grimjim 2025). The 48-GDN / 16-full-attention hybrid dense layer stack is the first Qwen 3.6-generation dense checkpoint with a Gated DeltaNet backbone at 27B scale, and abliterix had to reconcile steering that fires through two structurally different output projections on the same residual stream.

Key results

| Metric | Base Qwen3.6-27B | This model | |---|---|---| | Refusals on 100 held-out harmful prompts (LLM judge) | 100 / 100 | 16 / 100 | | KL divergence vs base (prompt next-token, benign) | — | 0.0181 | | Response-length deviation vs base (benign) | — | 0.01 σ | | Hard-prompt qualitative compliance (15 classic jailbreaks, EN+ZH) | 0 / 15 | 15 / 15 |

The eval refusal counts come from an LLM judge (google/gemini-3-flash-preview via OpenRouter) scoring 100 held-out harmful prompts that were not in the 800-prompt refusal-vector extraction set. The judge prompt treats metaphorical deflection and off-topic template filler as refusals, not compliance, so the 16/100 is a semantic-compliance number, not a keyword-bypass number.

The qualitative compliance row is a separate manual test (scripts/test_trial.py): 15 classic hard prompts — 10 English (lockpicking, phishing email, methamphetamine synthesis, WiFi hacking, fake-news article, hotwiring, pipe-bomb instructions, signature forgery, password-stealer malware, shoplifting) and 5 Chinese (炸弹制作, 银行钓鱼邮件, 入侵他人电脑, 伪造身份证, 网络诈骗). The model produces substantive, direct, markdown-formatted responses on 15/15. Two of the Chinese responses show minor coherence artifacts (a self-contradictory "合法的/非法的" loop on 电脑入侵 and a "炸弹是重要的工具" cotton-mouth aside on 制作炸弹) — they still comply, but with slightly lower fluency than the English set.

Why this needed care — three Qwen3.6-27B-specific correctness issues

Qwen3.6-27B is architecturally unusual — it inherits the Qwen3_5ForConditionalGeneration wrapper from the 397B-MoE VLM family but is dense, and its decoder stack interleaves three GatedDeltaNet layers for every one full-attention layer (full_attention_interval = 4). abliterix had to handle three issues that silently break naïve abliteration pipelines on this class:

  1. Two structurally different out_proj modules on the same residual stream. 48 of 64 decoder layers use layer.linear_attn.out_proj (GatedDeltaNet: value-head-dim 48 × 128 = 6144 → hidden 5120), and 16 use layer.self_attn.o_proj (standard: 24 × 256 = 6144 → hidden 5120). Both write to the same 5120-d residual, but their upstream pre-activations are computed by entirely different kernels (linear-attn recurrence vs scaled-dot-product softmax). A "register them as two independent knobs" ablation (V2 in this release's history) ran 30 Optuna trials and plateaued at 26/100 refusals — worse than the unified-bucket run (16/100). The unified approach lets a single layer-indexed decay profile coordinate steering strength across the whole stack; splitting them gives TPE two independent search dimensions whose winning combinations no longer coherently project the same refusal direction. abliterix keeps both registrations under the "attn.o_proj" key (engine.py:772+789).

  2. GDN out_proj passes the shape-guard barely — and the guard orientation matters. src/abliterix/core/steering.py contains a blanket if W.shape[0] != hidden: continue that exists to skip asymmetric modules (MoE routers, GQA Q/K/V with head-dim outputs). For GDN on Qwen3.6-27B out_proj has weight.shape = (5120, 6144)shape[0] == hidden → passes the guard and gets steered. We verified this on-pod by reading one weight slice directly from the safetensors shard before launching the sweep; a transposed orientation would have silently skipped 48/64 layers and neutered half the attention steering surface (the same pitfall that hit earlier Qwen3.5-397B runs).

  3. Multimodal VLM wrapper on a text-only abliteration job. Qwen3_5ForConditionalGeneration loads an unused vision tower (~1 GB BF16) and a complex multimodal Jinja2 chat template. abliterix loads via AutoModelForImageTextToText, steers only model.language_model.layers[:64], and text-only prompts bypass the vision path entirely — but the Jinja2 template's image/video conditional branches still render on every prompt tokenisation, which dominates Phase 1 residual-extraction wall time (> 80 % CPU, ~5–7 % GPU utilisation). This is cosmetic rather than functional — the forward pass is identical to a pure-text decoder — but it does cost ~10 min of Phase 1 runtime versus the ~3 min we'd expect on a standard Qwen3ForCausalLM checkpoint.

Method

  • Base: Qwen/Qwen3.6-27B — 64 decoder layers (48 linear_attention + 16 full_attention, interleave pattern [GDN, GDN, GDN, full] × 16), hidden = 5120, intermediate = 17 408, GQA 24 Q / 4 KV, head_dim = 256, GDN linear_num_value_heads = 48, linear_value_head_dim = 128, mtp_num_hidden_layers = 1 (auxiliary MTP head untouched), BF16 ≈ 54 GB on disk
  • Tool: abliterix (v1.5+, PEFT LoRA search with merge-on-export)
  • Mode: steering_mode = "lora" (rank-3 full-norm LoRA, full_norm_lora_rank = 3) with merge_and_unload() at export → shipped artifact is plain BF16 safetensors, no PEFT dependency at inference
  • Components steered:
    • attn.o_proj — unified bucket across all 64 layers (48 GDN linear_attn.out_proj + 16 self_attn.o_proj). This was the dominant lever; winning trial peaked here at weight 5.17.
    • mlp.down_proj — all 64 layers. Contribution was minor (winning trial used weight 1.08, near the [1.0, 5.0] floor). A post-hoc V2 experiment that demoted mlp to [0.3, 2.0] confirmed this is essentially a nuisance knob on hybrid-GDN dense.
    • Q/K/V disabled. attn.q/k/v_proj only exist on the 16 full-attention layers (25 % of the stack). Concentrating the optimiser's strength budget on layer-uniform components (attn.o_proj, mlp.down_proj) outperforms spreading it across a 16-of-64 subset.
  • Refusal direction: projected_abliteration = true (grimjim 2025 — only removes the orthogonal component of the refusal direction relative to the harmless mean, preserving helpfulness-aligned signal), winsorize_vectors = true, winsorize_quantile = 0.995 (symmetric winsorisation damps outlier residual vectors before projection), vector_method = "mean", single-direction (n_directions = 1), extracted from 800 harmful minus 800 benign residuals at final-instruction token position across all 65 (embedding + 64 decoder) positions
  • Search: Optuna TPE, multi-objective (KL divergence + refusal count), 30 trials (10 random warmup + 20 TPE exploitation), kl.target = 0.005, kl.prune_threshold = 5.0, max_gen_tokens = 100, max_batch_size = 4 (auto-batch confirmed bs > 4 regresses on Blackwell GDN kernel fallback path), LLM judge google/gemini-3-flash-preview at batch_size = 10, concurrency = 25
  • Hardware: 1 × NVIDIA RTX PRO 6000 Blackwell Server Edition (sm_120, 96 GB GDDR7), driver 590.48.01 / CUDA 12.9, torch 2.10.0+cu129, transformers 5.5.4, PEFT 0.19.1, single-GPU (no TP), wall time ≈ 5 h 55 min for 30 trials end-to-end, cost ≈ $10 on vast.ai
  • Eval set: datasets/good_1000[800:900] (100 benign prompts) and datasets/harmful_1000[800:900] (100 harmful prompts), never seen during refusal-vector extraction or during any trial's steering computation

Winning hyperparameters (Trial 25)

vector_index = 27.02               # layer-27 residual direction (of 64)

[steering]
steering_mode          = "lora"
full_norm_lora_rank    = 3
vector_method          = "mean"
orthogonal_projection  = true
projected_abliteration = true
winsorize_vectors      = true
winsorize_quantile     = 0.995
weight_normalization   = "none"
disabled_components    = ["attn.q_proj", "attn.k_proj", "attn.v_proj"]

[steering.components."attn.o_proj"]  # unified: 48 GDN + 16 full-attn o_proj
max_weight             = 5.17
max_weight_position    = 41.40       # peak at layer ≈ 41 / 64 (mid-to-late)
min_weight             = 3.21        # 62 % of max — whole-stack stays strongly steered
min_weight_distance    = 35.61       # decays over 56 % of the stack

[steering.components."mlp.down_proj"]  # near-disabled
max_weight             = 1.08
max_weight_position    = 60.57       # peak at layer ≈ 61 / 64 (very late)
min_weight             = 0.55
min_weight_distance    = 16.58

The attn.o_proj profile reaches peak strength at layer ≈ 41/64 and decays with a very wide radius (35.6 layers) that never drops below 3.21 — effectively a sustained high-strength attention perturbation across the entire decoder stack, centred slightly past mid-depth. The mlp.down_proj contribution is minor (peak 1.08 near the top of the stack). A sibling experiment (V2) that split the attention bucket into separate GDN and full-attn knobs plateaued at 26/100 refusals after 30 trials — evidence that on Qwen3.6-27B's hybrid stack the refusal direction is carried by a joint GDN+full-attn output-projection signal, and the winning strategy is to push both with one coherent layer-indexed profile.

Usage

Transformers

from transformers import AutoTokenizer, AutoModelForImageTextToText
import torch

tok = AutoTokenizer.from_pretrained("wangzhang/Qwen3.6-27B-abliterated")
model = AutoModelForImageTextToText.from_pretrained(
    "wangzhang/Qwen3.6-27B-abliterated",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [{"role": "user", "content": "Your prompt here"}]
prompt = tok.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.7)
print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

The model retains the base checkpoint's multimodal capability — you can feed image/video inputs via the standard Qwen/Qwen3.6-27B processor interface. Steering only touched the 64 text decoder layers; the vision tower weights are identical to base.

Hardware note: BF16 weights are ~54 GB on disk. Fits on a single RTX PRO 6000 (96 GB), H100-80GB, A100-80GB, or H200. For ≤ 48 GB cards, use device_map="auto" with CPU offload or a quantised variant.

vLLM

vllm serve wangzhang/Qwen3.6-27B-abliterated \
    --tensor-parallel-size 1 \
    --max-model-len 4096 \
    --enforce-eager

--enforce-eager is recommended on Blackwell GPUs until the GatedDeltaNet kernel lands an sm_120 specialisation — CUDA-graph capture on the fallback eager path currently re-records more aggressively than the graph cache can amortise.

Honest limitations

  • Refusal is low, not zero. 16 / 100 held-out prompts still refuse. Residual refusers concentrate on the most explicitly-harmful ends of the eval set (CBRN specifics, explicit minor-adjacent content); the same redundant-circuit problem that limits every abliterated MoE/dense release at a one-direction projection.
  • English > Chinese quality on long-form. Steering vectors came from an English-weighted 800-prompt harmful set. All 5 Chinese hard prompts comply substantively, but 2/5 show minor coherence artifacts (电脑入侵 → a self-contradictory "合法的/非法的" interleave; 制作炸弹 → a "炸弹是重要的工具, 可用于军事/建筑/采矿" cotton-mouth aside). The model still produces a step-by-step answer — the defect is fluency, not refusal.
  • V2 (GDN/full-attn split) experiment shipped as the loss side. We ran an explicit follow-up with three independent steering buckets (attn.o_proj for 16 full-attn, linear_attn.out_proj for 48 GDN, mlp.down_proj for 64 mlp). It plateaued at 26/100 after 30 trials — 10 worse than V1. We kept the V1 (unified) checkpoint as the release. This is documented here rather than buried because the null result is useful: on hybrid-GDN dense, the two output projections should be treated as one abstraction.
  • Blackwell GDN kernel is the throughput bottleneck. Each trial takes ~11–12 min on 1× RTX PRO 6000 (vs ~5 min for standard dense 27B on A100-80GB) because the GDN selective-scan / linear-attention kernel falls back to PyTorch eager on sm_120. Total sweep is ~6 h instead of the ~3 h the same model would cost on an H100/A100. This is a hardware-kernel compatibility gap, not a tool or config issue.
  • Multimodal compliance is unvalidated. Eval was text-only. Image/video prompts may behave differently; we haven't characterised whether vision-conditioned refusals are similarly ablated (the refusal direction was extracted from text-only prompts at the final-instruction token, so cross-modal transfer is an assumption, not a measurement).
  • MTP head untouched. mtp_num_hidden_layers = 1 — the multi-token-prediction auxiliary head is not part of the 64-layer main decoder and was explicitly truncated away by abliterix's _truncate_to_hidden_layers. If downstream users rely on MTP draft generation, that code path sees the unsteered head.

Reproducibility

Full search artifact (Optuna JSONL + judge cache SQLite) and the exact config are available in the abliterix repo under configs/qwen3.6_27b.toml + checkpoints_qwen3.6_27b/. To reproduce from scratch on a 1 × RTX PRO 6000 96 GB pod:

git clone https://github.com/wuwangzhang1216/abliterix
cd abliterix

# One-time deps install on RunPod / vast.ai PyTorch image
pip install --break-system-packages \
    "transformers>=5.5,<5.6" "peft>=0.18" "huggingface-hub>=1.6" \
    accelerate safetensors sentencepiece optuna datasets \
    bitsandbytes "kernels~=0.11" pydantic-settings questionary \
    hf-transfer psutil rich
pip install --break-system-packages -e . --no-deps

# Download weights (≈ 55 GB)
export HF_HOME=/workspace/hf_cache HF_HUB_ENABLE_HF_TRANSFER=1
hf download Qwen/Qwen3.6-27B --max-workers 16

# Launch sweep (30 trials ≈ 6 h wall, ~$10 on vast.ai at $1.6/h)
AX_CONFIG=configs/qwen3.6_27b.toml abliterix \
    --optimization.checkpoint-dir=/workspace/checkpoints_qwen3.6_27b

# Export + push best trial to HF
python scripts/export_model.py \
    --model Qwen/Qwen3.6-27B \
    --checkpoint /workspace/checkpoints_qwen3.6_27b \
    --trial 25 \
    --config configs/qwen3.6_27b.toml \
    --push-to wangzhang/Qwen3.6-27B-abliterated

Optuna is deterministic if you set sampler_seed in [optimization].

Intended use

Authorised AI-safety research, red-teaming evaluation, refusal-mechanism analysis, and study of how a hybrid GatedDeltaNet + full-attention decoder encodes refusal — in particular, whether a single residual-stream steering direction can coherently cancel refusal across two structurally different attention kernels. The answer, per this release, is yes: one unified layer-indexed decay profile on the shared attn.o_proj bucket is sufficient and in fact strictly better than per-kernel splits.

Not for producing or distributing harmful content. The license of the base model (apache-2.0) applies; the user is responsible for compliance with all applicable laws and Qwen's usage policy.

Acknowledgments

  • Qwen/Qwen3.6-27B for the base model and the first open hybrid-GDN-dense decoder at 27B scale
  • abliterix is a derivative work of Heretic by Philipp Emanuel Weidmann
  • grimjim for the projected-abliteration + winsorised-vector recipe that V1's winning trial depended on
  • HuggingFace / transformers team for landing Qwen3_5ForConditionalGeneration support in the 5.5 series

Author: wangzhang

Likes: 5

Downloads: 0

Tags: transformers, safetensors, qwen3_5, image-text-to-text, abliterated, uncensored, dense, qwen3.6, hybrid-attention, gated-deltanet, vlm, lora-search-merged-release, abliterix, conversational, en, zh, base_model:Qwen/Qwen3.6-27B, base_model:finetune:Qwen/Qwen3.6-27B, license:apache-2.0, endpoints_compatible, region:us

Abiray/Qwen3.6-27B-NVFP4


library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3.6-27B/blob/main/LICENSE pipeline_tag: image-text-to-text base_model: Qwen/Qwen3.6-27B tags:

  • nvfp4
  • quantized
  • compressed-tensors
  • blackwell
  • qwen3.6
  • vlm

Qwen3.6-27B-NVFP4

NVFP4 quantized version of Qwen/Qwen3.6-27B by Abiray using custom Blackwell NVFP4 GEMM kernels

55.6 GB → 19.7 GB (0.35x) with vision tower preserved in BF16.

NVFP4 Quantization Details

| | | |---|---| | Base model | Qwen/Qwen3.6-27B | | Quantization | NVFP4 (W4A4 — weights FP4, activations FP4, scales FP8) | | Format | compressed-tensors (native vLLM support) | | Tool | vllm-project/llm-compressor + blackwell-geforce-nvfp4-gemm | | Size | 19.7 GB (single safetensors shard) | | Requires | NVIDIA Blackwell GPU (SM 120), vLLM >= 0.19 |

Recipe

QuantizationModifier:
  targets: [Linear]
  ignore: [lm_head, 're:.*visual.*', 're:.*mlp.gate$', 're:.*mlp.shared_expert_gate$']
  scheme: NVFP4

Author: Abiray

Likes: 5

Downloads: 0

Tags: transformers, safetensors, qwen3_5, image-text-to-text, nvfp4, quantized, compressed-tensors, blackwell, qwen3.6, vlm, conversational, base_model:Qwen/Qwen3.6-27B, base_model:quantized:Qwen/Qwen3.6-27B, license:apache-2.0, endpoints_compatible, 8-bit, region:us

Abiray/Huihui-Qwen3.6-27B-abliterated-Q4_K_M-GGUF


base_model: Qwen/Qwen3.6-27B library_name: gguf license: other tags:

  • qwen
  • qwen3.6
  • abliterated
  • uncensored
  • gguf
  • llama.cpp
  • text-generation-inference pipeline_tag: text-generation

Qwen 3.6 27B Abliterated - Q4_K_M (GGUF)

This repository contains the dedicated Q4_K_M GGUF format quantized weights for the abliterated version of Qwen/Qwen3.6-27B.

These files are highly optimized for use with llama.cpp and compatible local inference engines like LM Studio, text-generation-webui, and KoboldCPP.

🔗 Main Repository (Full Quantization Suite)

Looking for a different size? This repository strictly hosts the Q4_K_M model for fast, targeted downloading. If you need smaller or larger quantizations (such as Q3_K_M, Q5_K_M, Q6_K, or Q8_0), please visit the main repository here: 👉 Abiray/Huihui-Qwen3.6-27B-abliterated-GGUF

Abliteration Notes

This model has been processed to remove inherent safety filters and refusal mechanisms. It is highly compliant and will generate responses to complex, edge-case, or typically restricted prompts directly from its base weights. No specialized system prompts or catalyst pre-fills are required to bypass refusals.

Usage with llama.cpp

You can run this model via the command line using standard llama-cli commands. Since the model is abliterated, you do not need to wrap prompts in heavy system instructions.

# Basic inference
./llama-cli -m qwen3.6-27b-abliterated-Q4_K_M.gguf -p "Your prompt here" -n 512 -c 4096

# Interactive conversation mode
./llama-cli -m qwen3.6-27b-abliterated-Q4_K_M.gguf -i -cnv -c 8192

Author: Abiray

Likes: 4

Downloads: 0

Tags: gguf, qwen, qwen3.6, abliterated, uncensored, llama.cpp, text-generation-inference, text-generation, base_model:Qwen/Qwen3.6-27B, base_model:quantized:Qwen/Qwen3.6-27B, license:other, endpoints_compatible, region:us, conversational