samuelcardillo/Qwopus-MoE-35B-A3B-GGUF
language:
- en
- zh license: apache-2.0 tags:
- qwen3.5
- moe
- reasoning
- distillation
- claude-opus
- qlora
- unsloth base_model: Qwen/Qwen3.5-35B-A3B datasets:
- nohurry/Opus-4.6-Reasoning-3000x-filtered
- Jackrong/Qwen3.5-reasoning-700x
- Roman1111111/claude-opus-4.6-10000x
Qwopus MoE 35B-A3B — Claude Opus 4.6 Reasoning Distilled (GGUF)
QLoRA fine-tune of Qwen3.5-35B-A3B (MoE, 3B active parameters) with Claude Opus 4.6 reasoning distillation. Training recipe adapted from Jackrong's Qwopus3.5-27B-v3 — same datasets and methodology, applied to the MoE architecture.
Credits
This model is heavily inspired by and based on the work of Jackrong and his Qwopus3.5-27B-v3 training methodology. The datasets, training philosophy ("act-then-refine" paradigm), and structural reasoning approach are all derived from his research. Please check his complete training guide for the full methodology.
The key difference: we adapted his recipe from the 27B dense model to the 35B-A3B MoE architecture.
Available Quantizations
| Quantization | Size | BPW | Min VRAM | |---|---|---|---| | Q8_0 | 35 GB | 8.52 | 1x 48GB GPU | | Q6_K | 27 GB | 6.58 | 1x 32GB GPU | | Q5_K_M | 24 GB | 5.70 | 1x 32GB GPU | | Q4_K_M | 20 GB | 4.87 | 1x 24GB GPU |
Model Details
| Property | Value | |---|---| | Base Model | Qwen/Qwen3.5-35B-A3B | | Architecture | Mixture of Experts (MoE) | | Total Parameters | ~35B | | Active Parameters | ~3B per token | | Max Context | 131,072 tokens (128K) |
Benchmark Results
Qwopus MoE (Jackrong recipe) vs Opus Distilled v2 (previous QLoRA)
Benchmarked across 8 diverse tasks: coding, bug detection, reasoning, instruction following, research, and agentic planning.
| Test | Qwopus MoE | Opus Distilled v2 | Winner | |---|---|---|---| | Coding: LRU Cache | 6.9KB content | 4.8KB content | Qwopus | | Coding: Async Scraper | 8.5KB content | 7.6KB content | Qwopus | | Bug Detection | 2.5KB + 2.1KB thinking | 2.4KB + 2.9KB thinking | Tie | | Reasoning: Probability | 0 chars (stuck thinking) | 1.3KB content | v2 | | Reasoning: Logic | 747 chars | 949 chars | v2 | | JSON Output | 319 chars, 6.8s | 325 chars, 1.4s | v2 (5x faster) | | Research: Architecture Analysis | 4.5KB content | 696 chars (overthinks) | Qwopus | | Agentic: CI/CD Planning | 6.9KB content | 5.8KB content | Qwopus |
Speed
| Model | tok/s | |---|---| | Qwopus MoE | 175 | | Opus Distilled v2 | 204 |
Verdict
Qwopus MoE produces more useful visible output — better content/thinking ratio. It excels at tasks requiring detailed, user-facing responses (coding, research, planning). The Opus Distilled v2 is 16% faster but has an aggressive thinking mode that sometimes produces minimal visible content.
Best for: Coding assistants, research agents, content generation, agentic workflows where output quality matters more than raw speed.
Training Details
Recipe (adapted from Jackrong's Qwopus3.5-27B-v3)
| Parameter | Value | |---|---| | Method | QLoRA (4-bit base + LoRA adapters in BF16) | | Framework | Unsloth 2026.4.2 + TRL | | Base Model | unsloth/Qwen3.5-35B-A3B | | LoRA Rank | 32 | | LoRA Alpha | 32 | | LoRA Targets | q_proj, k_proj, v_proj, o_proj (attention only) | | Trainable Parameters | 6,881,280 (0.02% of 35B) | | Learning Rate | 2e-5 (linear schedule) | | Warmup | 5% of steps | | Weight Decay | 0.001 | | Optimizer | adamw_8bit | | Epochs | 2 | | Effective Batch Size | 12 (1 x 12 grad accum) | | Max Sequence Length | 4096 | | Total Steps | 536 | | Final Loss | 0.5517 | | GPU | NVIDIA RTX PRO 6000 Blackwell (96GB) | | Training Time | ~3.5 hours |
Differences from Jackrong's 27B recipe
| Aspect | Jackrong (27B dense) | Ours (35B-A3B MoE) | |---|---|---| | Base model | Qwen3.5-27B (dense) | Qwen3.5-35B-A3B (MoE) | | LoRA rank | 64 | 32 (GPU memory constraint) | | LoRA targets | q, k, v, o, gate, up, down | q, k, v, o only (MoE experts too large) | | Trainable params | ~0.5% | 0.02% | | Batch size | ~36 | 12 | | Context length | 8192 | 4096 (GPU memory constraint) |
Datasets (3,209 examples after quality filtering)
| Dataset | Examples | Description | |---|---|---| | nohurry/Opus-4.6-Reasoning-3000x-filtered | 2,326 | Claude Opus 4.6 reasoning traces | | Jackrong/Qwen3.5-reasoning-700x | 633 | Qwen reasoning conversations | | Roman1111111/claude-opus-4.6-10000x | ~250 (after filtering) | Claude Opus 4.6 conversations |
Quality filter: required assistant content >100 characters.
Usage with llama.cpp
llama-server \
--model Qwopus-MoE-35B-A3B-Q8_0.gguf \
--n-gpu-layers -1 \
--ctx-size 131072 \
--host 0.0.0.0 --port 8082
The model uses <think>...</think> reasoning tags natively (inherited from Qwen3.5 base).
Acknowledgements
Author: samuelcardillo
Likes: 13
Downloads: 0
Tags: gguf, qwen3.5, moe, reasoning, distillation, claude-opus, qlora, unsloth, en, zh, dataset:nohurry/Opus-4.6-Reasoning-3000x-filtered, dataset:Jackrong/Qwen3.5-reasoning-700x, dataset:Roman1111111/claude-opus-4.6-10000x, base_model:Qwen/Qwen3.5-35B-A3B, base_model:quantized:Qwen/Qwen3.5-35B-A3B, license:apache-2.0, endpoints_compatible, region:us, conversational





