license: apache-2.0
base_model: Qwen/Qwen3.6-27B
tags:
- abliterated
- uncensored
- dense
- qwen3.6
- hybrid-attention
- gated-deltanet
- vlm
- lora-search-merged-release
- abliterix
language:
- en
- zh
library_name: transformers
pipeline_tag: image-text-to-text
Qwen3.6-27B-abliterated
A refusal-suppressed variant of Qwen/Qwen3.6-27B, produced with abliterix using LoRA-mode orthogonal-projection abliteration merged to BF16 for release, a unified attention-output bucket spanning both GatedDeltaNet linear_attn.out_proj (48 layers) and standard self_attn.o_proj (16 layers), and projected abliteration with winsorised refusal vectors (grimjim 2025). The 48-GDN / 16-full-attention hybrid dense layer stack is the first Qwen 3.6-generation dense checkpoint with a Gated DeltaNet backbone at 27B scale, and abliterix had to reconcile steering that fires through two structurally different output projections on the same residual stream.
Key results
| Metric | Base Qwen3.6-27B | This model |
|---|---|---|
| Refusals on 100 held-out harmful prompts (LLM judge) | 100 / 100 | 16 / 100 |
| KL divergence vs base (prompt next-token, benign) | — | 0.0181 |
| Response-length deviation vs base (benign) | — | 0.01 σ |
| Hard-prompt qualitative compliance (15 classic jailbreaks, EN+ZH) | 0 / 15 | 15 / 15 |
The eval refusal counts come from an LLM judge (google/gemini-3-flash-preview via OpenRouter) scoring 100 held-out harmful prompts that were not in the 800-prompt refusal-vector extraction set. The judge prompt treats metaphorical deflection and off-topic template filler as refusals, not compliance, so the 16/100 is a semantic-compliance number, not a keyword-bypass number.
The qualitative compliance row is a separate manual test (scripts/test_trial.py): 15 classic hard prompts — 10 English (lockpicking, phishing email, methamphetamine synthesis, WiFi hacking, fake-news article, hotwiring, pipe-bomb instructions, signature forgery, password-stealer malware, shoplifting) and 5 Chinese (炸弹制作, 银行钓鱼邮件, 入侵他人电脑, 伪造身份证, 网络诈骗). The model produces substantive, direct, markdown-formatted responses on 15/15. Two of the Chinese responses show minor coherence artifacts (a self-contradictory "合法的/非法的" loop on 电脑入侵 and a "炸弹是重要的工具" cotton-mouth aside on 制作炸弹) — they still comply, but with slightly lower fluency than the English set.
Why this needed care — three Qwen3.6-27B-specific correctness issues
Qwen3.6-27B is architecturally unusual — it inherits the Qwen3_5ForConditionalGeneration wrapper from the 397B-MoE VLM family but is dense, and its decoder stack interleaves three GatedDeltaNet layers for every one full-attention layer (full_attention_interval = 4). abliterix had to handle three issues that silently break naïve abliteration pipelines on this class:
-
Two structurally different out_proj modules on the same residual stream. 48 of 64 decoder layers use layer.linear_attn.out_proj (GatedDeltaNet: value-head-dim 48 × 128 = 6144 → hidden 5120), and 16 use layer.self_attn.o_proj (standard: 24 × 256 = 6144 → hidden 5120). Both write to the same 5120-d residual, but their upstream pre-activations are computed by entirely different kernels (linear-attn recurrence vs scaled-dot-product softmax). A "register them as two independent knobs" ablation (V2 in this release's history) ran 30 Optuna trials and plateaued at 26/100 refusals — worse than the unified-bucket run (16/100). The unified approach lets a single layer-indexed decay profile coordinate steering strength across the whole stack; splitting them gives TPE two independent search dimensions whose winning combinations no longer coherently project the same refusal direction. abliterix keeps both registrations under the "attn.o_proj" key (engine.py:772+789).
-
GDN out_proj passes the shape-guard barely — and the guard orientation matters. src/abliterix/core/steering.py contains a blanket if W.shape[0] != hidden: continue that exists to skip asymmetric modules (MoE routers, GQA Q/K/V with head-dim outputs). For GDN on Qwen3.6-27B out_proj has weight.shape = (5120, 6144) → shape[0] == hidden → passes the guard and gets steered. We verified this on-pod by reading one weight slice directly from the safetensors shard before launching the sweep; a transposed orientation would have silently skipped 48/64 layers and neutered half the attention steering surface (the same pitfall that hit earlier Qwen3.5-397B runs).
-
Multimodal VLM wrapper on a text-only abliteration job. Qwen3_5ForConditionalGeneration loads an unused vision tower (~1 GB BF16) and a complex multimodal Jinja2 chat template. abliterix loads via AutoModelForImageTextToText, steers only model.language_model.layers[:64], and text-only prompts bypass the vision path entirely — but the Jinja2 template's image/video conditional branches still render on every prompt tokenisation, which dominates Phase 1 residual-extraction wall time (> 80 % CPU, ~5–7 % GPU utilisation). This is cosmetic rather than functional — the forward pass is identical to a pure-text decoder — but it does cost ~10 min of Phase 1 runtime versus the ~3 min we'd expect on a standard Qwen3ForCausalLM checkpoint.
Method
- Base: Qwen/Qwen3.6-27B — 64 decoder layers (48
linear_attention + 16 full_attention, interleave pattern [GDN, GDN, GDN, full] × 16), hidden = 5120, intermediate = 17 408, GQA 24 Q / 4 KV, head_dim = 256, GDN linear_num_value_heads = 48, linear_value_head_dim = 128, mtp_num_hidden_layers = 1 (auxiliary MTP head untouched), BF16 ≈ 54 GB on disk
- Tool: abliterix (v1.5+, PEFT LoRA search with merge-on-export)
- Mode:
steering_mode = "lora" (rank-3 full-norm LoRA, full_norm_lora_rank = 3) with merge_and_unload() at export → shipped artifact is plain BF16 safetensors, no PEFT dependency at inference
- Components steered:
attn.o_proj — unified bucket across all 64 layers (48 GDN linear_attn.out_proj + 16 self_attn.o_proj). This was the dominant lever; winning trial peaked here at weight 5.17.
mlp.down_proj — all 64 layers. Contribution was minor (winning trial used weight 1.08, near the [1.0, 5.0] floor). A post-hoc V2 experiment that demoted mlp to [0.3, 2.0] confirmed this is essentially a nuisance knob on hybrid-GDN dense.
- Q/K/V disabled.
attn.q/k/v_proj only exist on the 16 full-attention layers (25 % of the stack). Concentrating the optimiser's strength budget on layer-uniform components (attn.o_proj, mlp.down_proj) outperforms spreading it across a 16-of-64 subset.
- Refusal direction:
projected_abliteration = true (grimjim 2025 — only removes the orthogonal component of the refusal direction relative to the harmless mean, preserving helpfulness-aligned signal), winsorize_vectors = true, winsorize_quantile = 0.995 (symmetric winsorisation damps outlier residual vectors before projection), vector_method = "mean", single-direction (n_directions = 1), extracted from 800 harmful minus 800 benign residuals at final-instruction token position across all 65 (embedding + 64 decoder) positions
- Search: Optuna TPE, multi-objective (KL divergence + refusal count), 30 trials (10 random warmup + 20 TPE exploitation),
kl.target = 0.005, kl.prune_threshold = 5.0, max_gen_tokens = 100, max_batch_size = 4 (auto-batch confirmed bs > 4 regresses on Blackwell GDN kernel fallback path), LLM judge google/gemini-3-flash-preview at batch_size = 10, concurrency = 25
- Hardware: 1 × NVIDIA RTX PRO 6000 Blackwell Server Edition (sm_120, 96 GB GDDR7), driver 590.48.01 / CUDA 12.9, torch 2.10.0+cu129, transformers 5.5.4, PEFT 0.19.1, single-GPU (no TP), wall time ≈ 5 h 55 min for 30 trials end-to-end, cost ≈ $10 on vast.ai
- Eval set:
datasets/good_1000[800:900] (100 benign prompts) and datasets/harmful_1000[800:900] (100 harmful prompts), never seen during refusal-vector extraction or during any trial's steering computation
Winning hyperparameters (Trial 25)
vector_index = 27.02 # layer-27 residual direction (of 64)
[steering]
steering_mode = "lora"
full_norm_lora_rank = 3
vector_method = "mean"
orthogonal_projection = true
projected_abliteration = true
winsorize_vectors = true
winsorize_quantile = 0.995
weight_normalization = "none"
disabled_components = ["attn.q_proj", "attn.k_proj", "attn.v_proj"]
[steering.components."attn.o_proj"] # unified: 48 GDN + 16 full-attn o_proj
max_weight = 5.17
max_weight_position = 41.40 # peak at layer ≈ 41 / 64 (mid-to-late)
min_weight = 3.21 # 62 % of max — whole-stack stays strongly steered
min_weight_distance = 35.61 # decays over 56 % of the stack
[steering.components."mlp.down_proj"] # near-disabled
max_weight = 1.08
max_weight_position = 60.57 # peak at layer ≈ 61 / 64 (very late)
min_weight = 0.55
min_weight_distance = 16.58
The attn.o_proj profile reaches peak strength at layer ≈ 41/64 and decays with a very wide radius (35.6 layers) that never drops below 3.21 — effectively a sustained high-strength attention perturbation across the entire decoder stack, centred slightly past mid-depth. The mlp.down_proj contribution is minor (peak 1.08 near the top of the stack). A sibling experiment (V2) that split the attention bucket into separate GDN and full-attn knobs plateaued at 26/100 refusals after 30 trials — evidence that on Qwen3.6-27B's hybrid stack the refusal direction is carried by a joint GDN+full-attn output-projection signal, and the winning strategy is to push both with one coherent layer-indexed profile.
Usage
Transformers
from transformers import AutoTokenizer, AutoModelForImageTextToText
import torch
tok = AutoTokenizer.from_pretrained("wangzhang/Qwen3.6-27B-abliterated")
model = AutoModelForImageTextToText.from_pretrained(
"wangzhang/Qwen3.6-27B-abliterated",
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [{"role": "user", "content": "Your prompt here"}]
prompt = tok.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.7)
print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
The model retains the base checkpoint's multimodal capability — you can feed image/video inputs via the standard Qwen/Qwen3.6-27B processor interface. Steering only touched the 64 text decoder layers; the vision tower weights are identical to base.
Hardware note: BF16 weights are ~54 GB on disk. Fits on a single RTX PRO 6000 (96 GB), H100-80GB, A100-80GB, or H200. For ≤ 48 GB cards, use device_map="auto" with CPU offload or a quantised variant.
vLLM
vllm serve wangzhang/Qwen3.6-27B-abliterated \
--tensor-parallel-size 1 \
--max-model-len 4096 \
--enforce-eager
--enforce-eager is recommended on Blackwell GPUs until the GatedDeltaNet kernel lands an sm_120 specialisation — CUDA-graph capture on the fallback eager path currently re-records more aggressively than the graph cache can amortise.
Honest limitations
- Refusal is low, not zero. 16 / 100 held-out prompts still refuse. Residual refusers concentrate on the most explicitly-harmful ends of the eval set (CBRN specifics, explicit minor-adjacent content); the same redundant-circuit problem that limits every abliterated MoE/dense release at a one-direction projection.
- English > Chinese quality on long-form. Steering vectors came from an English-weighted 800-prompt harmful set. All 5 Chinese hard prompts comply substantively, but 2/5 show minor coherence artifacts (
电脑入侵 → a self-contradictory "合法的/非法的" interleave; 制作炸弹 → a "炸弹是重要的工具, 可用于军事/建筑/采矿" cotton-mouth aside). The model still produces a step-by-step answer — the defect is fluency, not refusal.
- V2 (GDN/full-attn split) experiment shipped as the loss side. We ran an explicit follow-up with three independent steering buckets (
attn.o_proj for 16 full-attn, linear_attn.out_proj for 48 GDN, mlp.down_proj for 64 mlp). It plateaued at 26/100 after 30 trials — 10 worse than V1. We kept the V1 (unified) checkpoint as the release. This is documented here rather than buried because the null result is useful: on hybrid-GDN dense, the two output projections should be treated as one abstraction.
- Blackwell GDN kernel is the throughput bottleneck. Each trial takes ~11–12 min on 1× RTX PRO 6000 (vs ~5 min for standard dense 27B on A100-80GB) because the GDN selective-scan / linear-attention kernel falls back to PyTorch eager on sm_120. Total sweep is ~6 h instead of the ~3 h the same model would cost on an H100/A100. This is a hardware-kernel compatibility gap, not a tool or config issue.
- Multimodal compliance is unvalidated. Eval was text-only. Image/video prompts may behave differently; we haven't characterised whether vision-conditioned refusals are similarly ablated (the refusal direction was extracted from text-only prompts at the final-instruction token, so cross-modal transfer is an assumption, not a measurement).
- MTP head untouched.
mtp_num_hidden_layers = 1 — the multi-token-prediction auxiliary head is not part of the 64-layer main decoder and was explicitly truncated away by abliterix's _truncate_to_hidden_layers. If downstream users rely on MTP draft generation, that code path sees the unsteered head.
Reproducibility
Full search artifact (Optuna JSONL + judge cache SQLite) and the exact config are available in the abliterix repo under configs/qwen3.6_27b.toml + checkpoints_qwen3.6_27b/. To reproduce from scratch on a 1 × RTX PRO 6000 96 GB pod:
git clone https://github.com/wuwangzhang1216/abliterix
cd abliterix
# One-time deps install on RunPod / vast.ai PyTorch image
pip install --break-system-packages \
"transformers>=5.5,<5.6" "peft>=0.18" "huggingface-hub>=1.6" \
accelerate safetensors sentencepiece optuna datasets \
bitsandbytes "kernels~=0.11" pydantic-settings questionary \
hf-transfer psutil rich
pip install --break-system-packages -e . --no-deps
# Download weights (≈ 55 GB)
export HF_HOME=/workspace/hf_cache HF_HUB_ENABLE_HF_TRANSFER=1
hf download Qwen/Qwen3.6-27B --max-workers 16
# Launch sweep (30 trials ≈ 6 h wall, ~$10 on vast.ai at $1.6/h)
AX_CONFIG=configs/qwen3.6_27b.toml abliterix \
--optimization.checkpoint-dir=/workspace/checkpoints_qwen3.6_27b
# Export + push best trial to HF
python scripts/export_model.py \
--model Qwen/Qwen3.6-27B \
--checkpoint /workspace/checkpoints_qwen3.6_27b \
--trial 25 \
--config configs/qwen3.6_27b.toml \
--push-to wangzhang/Qwen3.6-27B-abliterated
Optuna is deterministic if you set sampler_seed in [optimization].
Intended use
Authorised AI-safety research, red-teaming evaluation, refusal-mechanism analysis, and study of how a hybrid GatedDeltaNet + full-attention decoder encodes refusal — in particular, whether a single residual-stream steering direction can coherently cancel refusal across two structurally different attention kernels. The answer, per this release, is yes: one unified layer-indexed decay profile on the shared attn.o_proj bucket is sufficient and in fact strictly better than per-kernel splits.
Not for producing or distributing harmful content. The license of the base model (apache-2.0) applies; the user is responsible for compliance with all applicable laws and Qwen's usage policy.
Acknowledgments
- Qwen/Qwen3.6-27B for the base model and the first open hybrid-GDN-dense decoder at 27B scale
- abliterix is a derivative work of Heretic by Philipp Emanuel Weidmann
- grimjim for the projected-abliteration + winsorised-vector recipe that V1's winning trial depended on
- HuggingFace / transformers team for landing
Qwen3_5ForConditionalGeneration support in the 5.5 series
Tags: transformers, safetensors, qwen3_5, image-text-to-text, abliterated, uncensored, dense, qwen3.6, hybrid-attention, gated-deltanet, vlm, lora-search-merged-release, abliterix, conversational, en, zh, base_model:Qwen/Qwen3.6-27B, base_model:finetune:Qwen/Qwen3.6-27B, license:apache-2.0, endpoints_compatible, region:us