Todays AI Summary

AI Developments: Reranking Models, Multi-Agent Systems, and Model Science

This article summarizes recent developments in AI, focusing on new models and research papers that introduce innovative approaches and achieve notable results.

Research Papers

Several interesting research papers have emerged, addressing diverse challenges in AI:

  • CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent with Decoupled Reinforcement Learning introduces a novel framework that integrates a generalist planner with a specialist executor, trained via a dedicated two-stage pipeline. CODA significantly outperforms baselines on the ScienceBoard benchmark.
  • Discrete-Guided Diffusion for Scalable and Safe Multi-Robot Motion Planning presents a framework that integrates discrete MAPF solvers with constrained generative diffusion models, achieving state-of-the-art performance in large-scale, complex environments, scaling to 100 robots.
  • Model Science: getting serious about verification, explanation and control of AI systems introduces a conceptual framework for a new discipline called Model Science, along with the proposal for its four key pillars: Verification, Explanation, Control, and Interface.
  • DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis introduces a live benchmark and holistic, automated evaluation framework designed to evaluate generative research synthesis.
  • Symphony: A Decentralized Multi-Agent Framework for Scalable Collective Intelligence introduces a decentralized multi-agent system which enables lightweight LLMs on consumer-grade GPUs to coordinate. Symphony outperforms existing baselines on reasoning benchmarks.
  • SWIRL: A Staged Workflow for Interleaved Reinforcement Learning in Mobile GUI Control introduces a staged workflow for interleaved reinforcement learning designed for multi-agent systems. SWIRL demonstrates superior performance on both high-level and low-level GUI benchmarks.
  • Decomposing Behavioral Phase Transitions in LLMs: Order Parameters for Emergent Misalignment develops a comprehensive framework for detecting and characterizing rapid transitions during fine-tuning using both distributional change detection methods as well as order parameters that are formulated in plain English and evaluated by an LLM judge.

Models

Several new models have been released, showcasing advancements in various AI tasks:

  • sigridjineth/ctxl-rerank-v2-1b-seq-cls is a converted SequenceClassification model based on ContextualAI/ctxl-rerank-v2-instruct-multilingual-1b. It exposes a single logit per input, enabling fast and simple reranking using standard text-classification tooling while preserving the original scores and ranking order.
  • adeelahmad/ReasonableQwen3-4B is based on Qwen3 and supports seamless switching between thinking mode and non-thinking mode within a single model.
  • weathermanj/NVIDIA-Nemotron-Nano-9B-v2-gguf offers GGUF quantizations of NVIDIA's NVIDIA-Nemotron-Nano-9B-v2, targeting llama.cpp-compatible runtimes.
  • fixie-ai/ultraVAD is a context-aware, audio-native endpointing model that estimates the probability that a speaker has finished their turn in real time by fusing recent dialog text with the user’s audio. It achieves high accuracy, precision, recall, F1-Score and AUC on context-dependent turn-taking.
  • prithivMLmods/DeepCaption-VLA-7B is a fine-tuned version of Qwen2.5-VL-7B-Instruct, tailored for Image Captioning and Vision Language Attribution.

Key Takeaways

  • Compositional frameworks are gaining traction, combining generalist and specialist agents for improved performance in complex tasks.
  • Decentralized multi-agent systems offer scalability and adaptability advantages over centralized approaches.
  • Model Science is emerging as a new discipline focused on understanding, verifying, and controlling AI model behavior.
  • Reranking models are becoming more efficient and easier to integrate into existing workflows.
  • Quantization techniques continue to improve, enabling deployment of large models on resource-constrained devices.

AI Papers for 2026-04-07

Enhancing Robustness of Federated Learning via Server Learning

This paper explores the use of server learning for enhancing the robustness of federated learning against malicious attacks even when clients' training data are not independent and identically distributed. We propose a heuristic algorithm that uses server learning and client update filtering in combination with geometric median aggregation. We demonstrate via experiments that this approach can achieve significant improvement in model accuracy even when the fraction of malicious clients is high, even more than $50\%$ in some cases, and the dataset utilized by the server is small and could be synthetic with its distribution not necessarily close to that of the clients' aggregated data.

PR3DICTR: A modular AI framework for medical 3D image-based detection and outcome prediction

Three-dimensional medical image data and computer-aided decision making, particularly using deep learning, are becoming increasingly important in the medical field. To aid in these developments we introduce PR3DICTR: Platform for Research in 3D Image Classification and sTandardised tRaining. Built using community-standard distributions (PyTorch and MONAI), PR3DICTR provides an open-access, flexible and convenient framework for prediction model development, with an explicit focus on classification using three-dimensional medical image data. By combining modular design principles and standardization, it aims to alleviate developmental burden whilst retaining adjustability. It provides users with a wealth of pre-established functionality, for instance in model architecture design options, hyper-parameter solutions and training methodologies, but still gives users the opportunity and freedom to ``plug in'' their own solutions or modules. PR3DICTR can be applied to any binary or event-based three-dimensional classification task and can work with as little as two lines of code.

Coupled Control, Structured Memory, and Verifiable Action in Agentic AI (SCRAT -- Stochastic Control with Retrieval and Auditable Trajectories): A Comparative Perspective from Squirrel Locomotion and Scatter-Hoarding

Agentic AI is increasingly judged not by fluent output alone but by whether it can act, remember, and verify under partial observability, delay, and strategic observation. Existing research often studies these demands separately: robotics emphasizes control, retrieval systems emphasize memory, and alignment or assurance work emphasizes checking and oversight. This article argues that squirrel ecology offers a sharp comparative case because arboreal locomotion, scatter-hoarding, and audience-sensitive caching couple all three demands in one organism. We synthesize evidence from fox, eastern gray, and, in one field comparison, red squirrels, and impose an explicit inference ladder: empirical observation, minimal computational inference, and AI design conjecture. We introduce a minimal hierarchical partially observed control model with latent dynamics, structured episodic memory, observer-belief state, option-level actions, and delayed verifier signals. This motivates three hypotheses: (H1) fast local feedback plus predictive compensation improves robustness under hidden dynamics shifts; (H2) memory organized for future control improves delayed retrieval under cue conflict and load; and (H3) verifiers and observer models inside the action-memory loop reduce silent failure and information leakage while remaining vulnerable to misspecification. A downstream conjecture is that role-differentiated proposer/executor/checker/adversary systems may reduce correlated error under asymmetric information and verification burden. The contribution is a comparative perspective and benchmark agenda: a disciplined program of falsifiable claims about the coupling of control, memory, and verifiable action.

Reliability Gated Multi-Teacher Distillation for Low Resource Abstractive Summarization

We study multiteacher knowledge distillation for low resource abstractive summarization from a reliability aware perspective. We introduce EWAD (Entropy Weighted Agreement Aware Distillation), a token level mechanism that routes supervision between teacher distillation and gold supervision based on inter teacher agreement, and CPDP (Capacity Proportional Divergence Preservation), a geometric constraint on the student position relative to heterogeneous teachers. Across two Bangla datasets, 13 BanglaT5 ablations, and eight Qwen2.5 experiments, we find that logit level KD provides the most reliable gains, while more complex distillation improves semantic similarity for short summaries but degrades longer outputs. Cross lingual pseudo label KD across ten languages retains 71-122 percent of teacher ROUGE L at 3.2x compression. A human validated multi judge LLM evaluation further reveals calibration bias in single judge pipelines. Overall, our results show that reliability aware distillation helps characterize when multi teacher supervision improves summarization and when data scaling outweighs loss engineering.

Gradient Boosting within a Single Attention Layer

Transformer attention computes a single softmax-weighted average over values -- a one-pass estimate that cannot correct its own errors. We introduce \emph{gradient-boosted attention}, which applies the principle of gradient boosting \emph{within} a single attention layer: a second attention pass, with its own learned projections, attends to the prediction error of the first and applies a gated correction. Under a squared reconstruction objective, the construction maps onto Friedman's gradient boosting machine, with each attention pass as a base learner and the per-dimension gate as the shrinkage parameter. We show that a single Hopfield-style update erases all query information orthogonal to the stored-pattern subspace, and that further iteration under local contraction can collapse distinct queries in the same region to the same fixed point. We also show that separate projections for the correction pass can recover residual information inaccessible to the shared-projection approach of Tukey's twicing. On a 10M-token subset of WikiText-103, gradient-boosted attention achieves a test perplexity of $67.9$ compared to $72.2$ for standard attention, $69.6$ for Twicing Attention, and $69.0$ for a parameter-matched wider baseline, with two rounds capturing most of the benefit.

Reflective Context Learning: Studying the Optimization Primitives of Context Space

Generally capable agents must learn from experience in ways that generalize across tasks and environments. The fundamental problems of learning, including credit assignment, overfitting, forgetting, local optima, and high-variance learning signals, persist whether the learned object lies in parameter space or context space. While these challenges are well understood in classical machine learning optimization, they remain underexplored in context space, leading current methods to be fragmented and ad hoc. We present Reflective Context Learning (RCL), a unified framework for agents that learn through repeated interaction, reflection on behavior and failure modes, and iterative updates to context. In RCL, reflection converts trajectories and current context into a directional update signal analogous to gradients, while mutation applies that signal to improve future behavior in context space. We recast recent context-optimization approaches as instances of this shared learning problem and systematically extend them with classical optimization primitives, including batching, improved credit-assignment signal, auxiliary losses, failure replay, and grouped rollouts for variance reduction. On AppWorld, BrowseComp+, and RewardBench2, these primitives improve over strong baselines, with their relative importance shifting across task regimes. We further analyze robustness to initialization, the effects of batch size, sampling and curriculum strategy, optimizer-state variants, and the impact of allocating stronger or weaker models to different optimization components. Our results suggest that learning through context updates should be treated not as a set of isolated algorithms, but as an optimization problem whose mechanisms can be studied systematically and improved through transferable principles.

Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models

The recent success of reinforcement learning (RL) in large reasoning models has inspired the growing adoption of RL for post-training Multimodal Large Language Models (MLLMs) to enhance their visual reasoning capabilities. Although many studies have reported improved performance, it remains unclear whether RL training truly enables models to learn from visual information. In this work, we propose the Hallucination-as-Cue Framework, an analytical framework designed to investigate the effects of RL-based post-training on multimodal reasoning models from the perspective of model hallucination. Specifically, we introduce hallucination-inductive, modality-specific corruptions that remove or replace essential information required to derive correct answers, thereby forcing the model to reason by hallucination. By applying these corruptions during both training and evaluation, our framework provides a unique perspective for diagnosing RL training dynamics and understanding the intrinsic properties of datasets. Through extensive experiments and analyses across multiple multimodal reasoning benchmarks, we reveal that the role of model hallucination for RL-training is more significant than previously recognized. For instance, we find that RL post-training under purely hallucination-inductive settings can still significantly improve models' reasoning performance, and in some cases even outperform standard training. These findings challenge prevailing assumptions about MLLM reasoning training and motivate the development of more modality-aware RL-based training designs.

Beyond the Parameters: A Technical Survey of Contextual Enrichment in Large Language Models: From In-Context Prompting to Causal Retrieval-Augmented Generation

Large language models (LLMs) encode vast world knowledge in their parameters, yet they remain fundamentally limited by static knowledge, finite context windows, and weakly structured causal reasoning. This survey provides a unified account of augmentation strategies along a single axis: the degree of structured context supplied at inference time. We cover in-context learning and prompt engineering, Retrieval-Augmented Generation (RAG), GraphRAG, and CausalRAG. Beyond conceptual comparison, we provide a transparent literature-screening protocol, a claim-audit framework, and a structured cross-paper evidence synthesis that distinguishes higher-confidence findings from emerging results. The paper concludes with a deployment-oriented decision framework and concrete research priorities for trustworthy retrieval-augmented NLP.

Chart-RL: Policy Optimization Reinforcement Learning for Enhanced Visual Reasoning in Chart Question Answering with Vision Language Models

The recent advancements in Vision Language Models (VLMs) have demonstrated progress toward true intelligence requiring robust reasoning capabilities. Beyond pattern recognition, linguistic reasoning must integrate with visual comprehension, particularly for Chart Question Answering (CQA) tasks involving complex data visualizations. Current VLMs face significant limitations in CQA, including imprecise numerical extraction, difficulty interpreting implicit visual relationships, and inadequate attention mechanisms for capturing spatial relationships in charts. In this work, we address these challenges by presenting Chart-RL, a novel reinforcement learning framework that enhances VLMs chart understanding through feedback-driven policy optimization of visual perception and logical inference. Our key innovation includes a comprehensive framework integrating Reinforcement Learning (RL) from Policy Optimization techniques along with adaptive reward functions, that demonstrates superior performance compared to baseline foundation models and competitive results against larger state-of-the-art architectures. We also integrated Parameter-Efficient Fine-Tuning through Low-Rank Adaptation (LoRA) in the RL framework that only requires single GPU configurations while preserving performance integrity. We conducted extensive benchmarking across open-source, proprietary, and state-of-the-art closed-source models utilizing the ChartQAPro dataset. The RL fine-tuned Qwen3-VL-4B-Instruct model achieved an answer accuracy of 0.634, surpassing the 0.580 accuracy of the Qwen3-VL-8B-Instruct foundation model despite utilizing half the parameter count, while simultaneously reducing inference latency from 31 seconds to 9 seconds.

Valence-Arousal Subspace in LLMs: Circular Emotion Geometry and Multi-Behavioral Control

We present a method to identify a valence-arousal (VA) subspace within large language model representations. From 211k emotion-labeled texts, we derive emotion steering vectors, then learn VA axes as linear combinations of their top PCA components via ridge regression on the model's self-reported valence-arousal scores. The resulting VA subspace exhibits circular geometry consistent with established models of human emotion perception. Projections along our recovered VA subspace correlate with human-crowdsourced VA ratings across 44k lexical items. Furthermore, steering generation along these axes produces monotonic shifts in the corresponding affective dimensions of model outputs. Steering along these directions also induces near-monotonic bidirectional control over refusal and sycophancy: increasing arousal decreases refusal and increases sycophancy, and vice versa. These effects replicate across Llama-3.1-8B, Qwen3-8B, and Qwen3-14B, demonstrating cross-architecture generality. We provide a mechanistic account for these effects and prior emotionally-framed controls: refusal-associated tokens ("I can't," "sorry") occupy low-arousal, negative-valence regions, so VA steering directly modulates their emission probability.

AI Models

samuelcardillo/Qwopus-MoE-35B-A3B-GGUF


language:

  • en
  • zh license: apache-2.0 tags:
  • qwen3.5
  • moe
  • reasoning
  • distillation
  • claude-opus
  • qlora
  • unsloth base_model: Qwen/Qwen3.5-35B-A3B datasets:
  • nohurry/Opus-4.6-Reasoning-3000x-filtered
  • Jackrong/Qwen3.5-reasoning-700x
  • Roman1111111/claude-opus-4.6-10000x

Qwopus MoE 35B-A3B — Claude Opus 4.6 Reasoning Distilled (GGUF)

QLoRA fine-tune of Qwen3.5-35B-A3B (MoE, 3B active parameters) with Claude Opus 4.6 reasoning distillation. Training recipe adapted from Jackrong's Qwopus3.5-27B-v3 — same datasets and methodology, applied to the MoE architecture.

Credits

This model is heavily inspired by and based on the work of Jackrong and his Qwopus3.5-27B-v3 training methodology. The datasets, training philosophy ("act-then-refine" paradigm), and structural reasoning approach are all derived from his research. Please check his complete training guide for the full methodology.

The key difference: we adapted his recipe from the 27B dense model to the 35B-A3B MoE architecture.

Available Quantizations

| Quantization | Size | BPW | Min VRAM | |---|---|---|---| | Q8_0 | 35 GB | 8.52 | 1x 48GB GPU | | Q6_K | 27 GB | 6.58 | 1x 32GB GPU | | Q5_K_M | 24 GB | 5.70 | 1x 32GB GPU | | Q4_K_M | 20 GB | 4.87 | 1x 24GB GPU |

Model Details

| Property | Value | |---|---| | Base Model | Qwen/Qwen3.5-35B-A3B | | Architecture | Mixture of Experts (MoE) | | Total Parameters | ~35B | | Active Parameters | ~3B per token | | Max Context | 131,072 tokens (128K) |

Benchmark Results

Qwopus MoE (Jackrong recipe) vs Opus Distilled v2 (previous QLoRA)

Benchmarked across 8 diverse tasks: coding, bug detection, reasoning, instruction following, research, and agentic planning.

| Test | Qwopus MoE | Opus Distilled v2 | Winner | |---|---|---|---| | Coding: LRU Cache | 6.9KB content | 4.8KB content | Qwopus | | Coding: Async Scraper | 8.5KB content | 7.6KB content | Qwopus | | Bug Detection | 2.5KB + 2.1KB thinking | 2.4KB + 2.9KB thinking | Tie | | Reasoning: Probability | 0 chars (stuck thinking) | 1.3KB content | v2 | | Reasoning: Logic | 747 chars | 949 chars | v2 | | JSON Output | 319 chars, 6.8s | 325 chars, 1.4s | v2 (5x faster) | | Research: Architecture Analysis | 4.5KB content | 696 chars (overthinks) | Qwopus | | Agentic: CI/CD Planning | 6.9KB content | 5.8KB content | Qwopus |

Speed

| Model | tok/s | |---|---| | Qwopus MoE | 175 | | Opus Distilled v2 | 204 |

Verdict

Qwopus MoE produces more useful visible output — better content/thinking ratio. It excels at tasks requiring detailed, user-facing responses (coding, research, planning). The Opus Distilled v2 is 16% faster but has an aggressive thinking mode that sometimes produces minimal visible content.

Best for: Coding assistants, research agents, content generation, agentic workflows where output quality matters more than raw speed.

Training Details

Recipe (adapted from Jackrong's Qwopus3.5-27B-v3)

| Parameter | Value | |---|---| | Method | QLoRA (4-bit base + LoRA adapters in BF16) | | Framework | Unsloth 2026.4.2 + TRL | | Base Model | unsloth/Qwen3.5-35B-A3B | | LoRA Rank | 32 | | LoRA Alpha | 32 | | LoRA Targets | q_proj, k_proj, v_proj, o_proj (attention only) | | Trainable Parameters | 6,881,280 (0.02% of 35B) | | Learning Rate | 2e-5 (linear schedule) | | Warmup | 5% of steps | | Weight Decay | 0.001 | | Optimizer | adamw_8bit | | Epochs | 2 | | Effective Batch Size | 12 (1 x 12 grad accum) | | Max Sequence Length | 4096 | | Total Steps | 536 | | Final Loss | 0.5517 | | GPU | NVIDIA RTX PRO 6000 Blackwell (96GB) | | Training Time | ~3.5 hours |

Differences from Jackrong's 27B recipe

| Aspect | Jackrong (27B dense) | Ours (35B-A3B MoE) | |---|---|---| | Base model | Qwen3.5-27B (dense) | Qwen3.5-35B-A3B (MoE) | | LoRA rank | 64 | 32 (GPU memory constraint) | | LoRA targets | q, k, v, o, gate, up, down | q, k, v, o only (MoE experts too large) | | Trainable params | ~0.5% | 0.02% | | Batch size | ~36 | 12 | | Context length | 8192 | 4096 (GPU memory constraint) |

Datasets (3,209 examples after quality filtering)

| Dataset | Examples | Description | |---|---|---| | nohurry/Opus-4.6-Reasoning-3000x-filtered | 2,326 | Claude Opus 4.6 reasoning traces | | Jackrong/Qwen3.5-reasoning-700x | 633 | Qwen reasoning conversations | | Roman1111111/claude-opus-4.6-10000x | ~250 (after filtering) | Claude Opus 4.6 conversations |

Quality filter: required assistant content >100 characters.

Usage with llama.cpp

llama-server \
  --model Qwopus-MoE-35B-A3B-Q8_0.gguf \
  --n-gpu-layers -1 \
  --ctx-size 131072 \
  --host 0.0.0.0 --port 8082

The model uses <think>...</think> reasoning tags natively (inherited from Qwen3.5 base).

Acknowledgements

  • Jackrong — Training methodology, datasets, and the Qwopus concept
  • Unsloth — Efficient QLoRA training framework
  • Qwen — Base model architecture

Author: samuelcardillo

Likes: 13

Downloads: 0

Tags: gguf, qwen3.5, moe, reasoning, distillation, claude-opus, qlora, unsloth, en, zh, dataset:nohurry/Opus-4.6-Reasoning-3000x-filtered, dataset:Jackrong/Qwen3.5-reasoning-700x, dataset:Roman1111111/claude-opus-4.6-10000x, base_model:Qwen/Qwen3.5-35B-A3B, base_model:quantized:Qwen/Qwen3.5-35B-A3B, license:apache-2.0, endpoints_compatible, region:us, conversational

caiovicentino1/LTX-2.3-22B-PolarQuant-Q5


license: other license_name: ltx-2-community-license tags:

  • polarquant
  • video-generation
  • audio-video
  • ltx-video
  • bit-packed base_model: Lightricks/LTX-2.3 pipeline_tag: image-to-video arxiv: "2603.29078"

LTX-2.3 (22B) — PolarQuant Q5 (Bit-Packed)

PQ5 quantized LTX-2.3 — joint audio-video generation, 22B params.

46 GB → 15 GB (-68%) | cos_sim 0.9986 | 1,347 layers quantized

Download Size

Download

Compression

Compression

| Component | Original | PQ5 Packed | Reduction | |---|---|---|---| | Transformer (1,347 layers) | 37 GB | 4.6 GB | -88% | | VAE + Skip (4,600 layers) | 9.1 GB | 9.1 GB | BF16 kept | | Upscalers | 1.3 GB | 1.3 GB | BF16 kept | | Total | 46.2 GB | 15 GB | -68% |

Quick Start

# 1. Install
pip install safetensors huggingface_hub scipy

# 2. Download & setup (15 GB)
git clone https://huggingface.co/caiovicentino1/LTX-2.3-22B-PolarQuant-Q5 ./LTX-PQ5
cd LTX-PQ5 && python setup.py

# 3. Generate video
python generate_ltx.py --prompt "A cat playing piano in a jazz club"

Architecture

  • 22B parameters — largest video model we've quantized
  • 48 transformer blocks, hidden=4096, MLP=16384
  • Joint audio-video generation — synchronized audio + video
  • head_dim=64 (Hadamard-compatible)
  • 5,947 tensors total (BF16)
  • Spatial + temporal upscalers included

Files

LTX-2.3-22B-PolarQuant-Q5/
├── setup.py                          # One-command setup
├── generate_ltx.py                   # Easy generation wrapper
├── polarquant/
│   ├── codes/
│   │   ├── chunk_00_codes.safetensors (1.5 GB)
│   │   ├── chunk_01_codes.safetensors (1.5 GB)
│   │   ├── chunk_02_codes.safetensors (1.5 GB)
│   │   └── chunk_03_codes.safetensors (0.1 GB)
│   └── bf16/
│       └── ltx23_bf16.safetensors    (9.1 GB)
└── upscalers/
    ├── ltx-2.3-spatial-upscaler-x2-1.1.safetensors  (1.0 GB)
    └── ltx-2.3-temporal-upscaler-x2-1.0.safetensors (0.3 GB)

Hardware

| GPU | VRAM | Status | |-----|------|--------| | A100 (40 GB) | 40 GB | Recommended | | A100 (80 GB) | 80 GB | Full speed | | RTX 4090 (24 GB) | 24 GB | With offloading |

Links

Citation

@article{polarquant2026,
  title={PolarQuant: Hadamard-Rotated Lloyd-Max Quantization},
  author={Vicentino, Caio},
  journal={arXiv preprint arXiv:2603.29078},
  year={2026}
}

46 GB → 15 GB with cos_sim 0.9986. Quantized with PolarQuant.

Author: caiovicentino1

Likes: 11

Downloads: 0

Tags: polarquant, video-generation, audio-video, ltx-video, bit-packed, image-to-video, arxiv:2603.29078, arxiv:2601.03233, base_model:Lightricks/LTX-2.3, base_model:finetune:Lightricks/LTX-2.3, license:other, region:us

VAGOsolutions/SauerkrautLM-Doom-MultiVec-1.3M


tags:

  • doom
  • game-ai
  • ascii
  • ModernBERT
  • hash-embeddings
  • depth-aware
  • attention-pooling
  • classifier
  • real-time
  • edge-deployment
  • tiny-model pipeline_tag: text-classification library_name: transformers license: apache-2.0 datasets:
  • VAGOsolutions/SauerkrautLM-Doom-MultiVec-31k

<img src="Logo.png" width="500" height="auto">

SauerkrautLM-Doom-MultiVec-1.3M

A tiny 1.3M parameter model that plays DOOM, outperforming LLMs up to 92,000x its size.

<video controls width="100%" style="border-radius: 8px; margin: 16px 0;"> <source src="https://vago-solutions.ai/wp-content/uploads/2026/04/1mio-parameter-plays-DOOM.mp4" type="video/mp4"> </video>

This model is a ModernBERT-Hash encoder with depth-aware token representations and an attention pooling classification head, trained on 31K human gameplay demonstrations to select actions from ASCII game frame representations in real time.

Core Features and Innovations

  • 178 frags in 10 episodes (17.8 per episode) in VizDoom's defend_the_center scenario, more than all tested LLMs combined (13 frags total)
  • 31ms inference on CPU, enabling real-time gameplay at 35 FPS
  • Depth-aware ASCII encoding: VizDoom depth buffer encoded as learned 16-bin token embeddings fused with character embeddings
  • ModernBERT-Hash architecture: Hash embeddings + local/global attention + Flash Attention 2 support
  • Character-level tokenizer: 75 tokens, no BPE, preserving full spatial granularity of ASCII frames

David vs. Goliath: 1.3M Parameters vs. 120 Billion

With 1.3 million parameters -- less than 1/92,000th the size of Nemotron-120B -- SauerkrautLM-Doom-MultiVec achieves:

  • 178 frags vs 0 for GPT-4o-mini (proprietary)
  • 178 frags vs 3 for Nemotron-120B (120B, 92,000x larger)
  • 178 frags vs 2 for Qwen3.5-27B (27B, 20,000x larger)
  • 178 frags vs 8 for Gemini Flash Lite (proprietary)

All LLMs are vision-capable multimodal models evaluated on text (ASCII + depth), their strongest modality. Our tiny text-only model outperforms them on their home turf.


Model Overview

Model: VAGOsolutions/SauerkrautLM-Doom-MultiVec-1.3M
Architecture: ModernBERT-Hash encoder + Attention Pooling + Linear Classifier
Task: Real-time DOOM action classification from ASCII frames
Training Data: 31,645 human gameplay demonstrations with depth annotations
License: Apache 2.0
Model Size: 1.3M parameters (~5MB FP32)

Model Description

  • Model Type: Multi-vector encoder with attention pooling classification head
  • Encoder: 5-layer ModernBERT with hash embeddings (H=128, 4 heads)
  • Tokenizer: Character-level, 75 tokens (no BPE)
  • Depth Bins: 16 learned depth embeddings added to token representations
  • Actions: 4 (shoot, move_forward, turn_left, turn_right)
  • Max Sequence Length: 1,200 tokens
  • Training Loss: KL-divergence on soft action scores
  • Inference Latency: 31ms (CPU), 29ms (GPU)

Architecture

DoomMultiVecClassifier(
  encoder: ModernBertHashModel(
    embeddings: HashEmbedding(75 vocab, 16 proj -> 128 dim)
    depth_embedding: DepthEmbedding(16 bins x 128 dim)
    layers: 5x TransformerLayer(H=128, heads=4, FFN=512)
  )
  attention_pool: Linear(128 -> 1)  # learned attention weights
  classifier: Linear(128 -> 4)      # action probabilities
)

ModernBERT-Hash: Advancing Hash Embeddings to Modern Architectures

This model introduces hash embeddings on the ModernBERT architecture, a combination that has not been explored before. Previous work on hash embeddings for tiny language models (NeuML's BERT-Hash, Svenstrup et al. 2017) applied the technique to the original BERT architecture from 2018. We bring hash embeddings to ModernBERT (Warner et al. 2024), which provides several architectural advantages:

  • Rotary Position Embeddings (RoPE) instead of learned absolute positions, enabling better generalization across sequence lengths
  • Alternating local + global attention: Layers alternate between sliding-window local attention (w=128) and full global attention, matching the spatial structure of ASCII frames where local patterns (adjacent characters) and global context (arena layout) both matter
  • Flash Attention 2 support for efficient GPU training with long sequences (~1,100 tokens per frame)
  • Pre-normalization with RMSNorm for more stable training of tiny models

The hash embedding layer replaces the standard embedding table (V x H) with a two-stage projection: a compact lookup (V x P) followed by a linear projection (P x H), where P=16 is the projection dimension. For our 75-token vocabulary this reduces embedding parameters from 9,600 to 4,480 (53% reduction). While modest for a tiny vocabulary, the same architecture scales to standard vocabularies: at 30K tokens, hash embeddings reduce embedding parameters by 97% (from 3.8M to 120K), which is the key enabler for sub-1M parameter language models.

Combined with depth embeddings (16 learned bins added to token representations), the model receives both spatial (what the character looks like) and distance (how far away it is) information at the token level, a novel input representation for game state encoding.


Input Pipeline

<img src="pipeline.png" width="100%">

From VizDoom game frame to model input: (a) RGB frame, (b) grayscale, (c) depth buffer, (d) ASCII brightness map (40x25), (e) depth bins with 16 quantization levels (red=near, green=far), (f) combined ASCII + depth representation as fed to the model.


Benchmark: DOOM defend_the_center

All agents receive identical input: ASCII frame (40x25) + depth map. Game settings match training conditions (640x480, HUD on, real-time pacing). Frags are tracked via VizDoom's per-step reward signal.

| Agent | Params | Episodes | Avg Survival | Max Survival | Total Frags | Latency | |-------|--------|----------|-------------|-------------|-------------|---------| | SauerkrautLM-Doom-MultiVec-1.3M | 1.3M | 10 | 388 | 525 | 178 | 31ms | | GPT-4o-mini | proprietary | 10 | 104 | 228 | 0 | 646ms | | Nemotron-120B | 120B | 5 | 88 | 104 | 3 | 8.9s | | Qwen3.5-27B | 27B | 3 | 87 | 109 | 2 | 13.3s | | Gemini Flash Lite | proprietary | 10 | 81 | 97 | 8 | 920ms |

178 frags vs 13 for all LLMs combined. GPT-4o-mini scores zero frags across 10 episodes (pure evasion). Our model actively engages enemies, turns to face them, and fires -- playing DOOM as intended.


Quick Start

import torch
from transformers import AutoTokenizer

# Load model
model_path = "VAGOsolutions/SauerkrautLM-Doom-MultiVec-1.3M"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

from doom_multivec.model.classifier import DoomMultiVecClassifier
state = torch.load(f"{model_path}/model.pt", map_location="cpu")
model = DoomMultiVecClassifier(model_path, pool_mode="attention", num_actions=4)
model.load_state_dict(state)
model.eval()

# Classify an ASCII frame
ascii_frame = "." * 1024  # 40x25 ASCII frame
encoded = tokenizer(ascii_frame, return_tensors="pt", max_length=1100,
                    padding="max_length", truncation=True)

with torch.no_grad():
    result = model(encoded["input_ids"], encoded["attention_mask"])
    probs = torch.softmax(result["logits"], dim=-1)[0]

actions = ["shoot", "move_forward", "turn_left", "turn_right"]
for action, prob in zip(actions, probs):
    print(f"  {action}: {prob:.3f}")

Watch the Model Play DOOM

pip install vizdoom
python scripts/play_doom_visual.py --model models/doom-multivec-trained --scenario defend_the_center

Run the LLM Benchmark

pip install openai
export OPENAI_API_KEY="your-key"
python scripts/benchmark.py --agent multivec --episodes 10 --realtime
python scripts/benchmark.py --agent gpt4mini --episodes 10 --realtime

Parameter Budget

| Component | Parameters | % of Total | |-----------|-----------|------------| | Hash Embeddings (75 vocab, 16 proj) | 4,480 | 0.3% | | Depth Embeddings (17 bins x 128) | 2,176 | 0.2% | | Transformer Layers (x 5) | 1,312,000 | 99.1% | | Attention Pool + Classifier | 644 | 0.05% | | Total | 1,319,300 | 100% |


Paper

Playing DOOM with 1.3M Parameters: Specialized Small Models vs Large Language Models for Real-Time Game Control

David Golchinfar (VAGO Solutions, Germany), Daryoush Vaziri (University of Applied Sciences Bonn-Rhein-Sieg, Germany), Alexander Marquardt (CARE Laboratory, NAIST, Japan)

Available in the project repository under paper/doom_multivec.pdf.


Acknowledgements

This work was developed using VizDoom as the game platform, PyLate for initial multi-vector experiments, and the HuggingFace ecosystem for model development. The ModernBERT-Hash architecture builds on NeuML's BERT-Hash models. Training data includes human gameplay demonstrations and frames from the arnaudstiegler GameNGen reproduction datasets on HuggingFace.

DOOM is a registered trademark of id Software LLC. This project is not affiliated with or endorsed by id Software.


Citation

@misc{SauerkrautLM-Doom-MultiVec,
  title={SauerkrautLM-Doom-MultiVec-1.3M: Playing DOOM with 1.3M Parameters},
  author={David Golchinfar and Daryoush Vaziri and Alexander Marquardt},
  url={https://huggingface.co/VAGOsolutions/SauerkrautLM-Doom-MultiVec-1.3M},
  year={2026}
}

Author: VAGOsolutions

Likes: 8

Downloads: 0

Tags: transformers, safetensors, modernbert, feature-extraction, doom, game-ai, ascii, ModernBERT, hash-embeddings, depth-aware, attention-pooling, classifier, real-time, edge-deployment, tiny-model, text-classification, custom_code, dataset:VAGOsolutions/SauerkrautLM-Doom-MultiVec-31k, license:apache-2.0, text-embeddings-inference, endpoints_compatible, region:us

douyamv/Gemma-4-31B-JANG_4M-CRACK-GGUF


language:

  • en license: gemma tags:
  • gemma4
  • gguf
  • quantized
  • 31b base_model: google/gemma-4-31b-it pipeline_tag: text-generation

Gemma-4-31B-JANG_4M-CRACK-GGUF

GGUF quantizations of Gemma-4-31B-JANG_4M-CRACK for use with llama.cpp, LM Studio, Ollama, and other GGUF-compatible inference engines.

About the Model

  • Base model: google/gemma-4-31b-it
  • Architecture: Gemma 4 Dense Transformer (31B parameters, 60 layers)
  • Features: Hybrid Sliding/Global Attention, Vision + Audio multimodal
  • Modification: CRACK abliteration (refusal removal) + JANG v2 mixed-precision quantization

Why This Conversion?

The original model uses JANG v2 mixed-precision MLX quantization (attention 8-bit + MLP 4-bit), which is only compatible with vMLX. Standard tools (llama.cpp, LM Studio, oMLX, mlx-lm) cannot load this format due to mixed per-layer bit widths.

This repository provides standard GGUF quantizations that work everywhere.

Conversion Process

Original (JANG v2 MLX safetensors, ~18GB)
    ↓ dequantize (attention 8-bit → f16, MLP 4-bit → f16)
Intermediate (float16 safetensors, ~60GB)
    ↓ convert_hf_to_gguf.py + quantize
GGUF (various quantizations)

Note: Since the original was already quantized (avg 5.1 bits), the dequantized f16 intermediate is an approximation. Re-quantizing to GGUF introduces minimal additional quality loss since the attention layers were preserved at 8-bit in the original.

Available Quantizations

| File | Quant | Size | Quality | Notes | |------|-------|------|---------|-------| | gemma-4-31b-jang-crack-Q3_K_M.gguf | Q3_K_M | ~14 GB | Acceptable | Minimum viable quality | | gemma-4-31b-jang-crack-Q4_K_M.gguf | Q4_K_M | ~18 GB | Good | Best size/quality balance | | gemma-4-31b-jang-crack-Q5_K_M.gguf | Q5_K_M | ~21 GB | Better | Recommended if RAM allows | | gemma-4-31b-jang-crack-Q6_K.gguf | Q6_K | ~25 GB | Very Good | High quality | | gemma-4-31b-jang-crack-Q8_0.gguf | Q8_0 | ~33 GB | Near lossless | Closest to original |

System Requirements

| Quantization | Minimum RAM | Recommended | |-------------|------------|-------------| | Q3_K_M | 20 GB | 24 GB | | Q4_K_M | 24 GB | 32 GB | | Q5_K_M | 28 GB | 36 GB | | Q6_K | 32 GB | 40 GB | | Q8_0 | 40 GB | 48 GB |

Usage

LM Studio

Download any .gguf file and open it in LM Studio.

llama.cpp

./llama-cli -m gemma-4-31b-jang-crack-Q4_K_M.gguf -p "Hello" -n 256

Ollama

echo 'FROM ./gemma-4-31b-jang-crack-Q4_K_M.gguf' > Modelfile
ollama create gemma4-crack -f Modelfile
ollama run gemma4-crack

License

Gemma License

Disclaimer

This model has had safety guardrails removed. Use responsibly and in compliance with applicable laws.

Author: douyamv

Likes: 7

Downloads: 0

Tags: gemma4, gguf, quantized, 31b, text-generation, en, license:gemma, region:us

huihui-ai/Huihui-Qwopus3.5-27B-v3-abliterated


library_name: transformers license: apache-2.0 pipeline_tag: image-text-to-text base_model:

  • Jackrong/Qwopus3.5-27B-v3 tags:
  • abliterated
  • uncensored
  • Qwopus
  • reasoning
  • chain-of-thought
  • Dense

huihui-ai/Huihui-Qwopus3.5-27B-v3-abliterated

This is an uncensored version of Jackrong/Qwopus3.5-27B-v3 created with abliteration (see remove-refusals-with-transformers to know more about it). This is a crude, proof-of-concept implementation to remove refusals from an LLM model without using TransformerLens.

ollama

Please use the latest version of ollama

You can use huihui_ai/qwen3.5-abliterated:27B-Qwopus directly,

ollama run huihui_ai/qwen3.5-abliterated:27B-Qwopus

Usage Warnings

  • Risk of Sensitive or Controversial Outputs: This model’s safety filtering has been significantly reduced, potentially generating sensitive, controversial, or inappropriate content. Users should exercise caution and rigorously review generated outputs.

  • Not Suitable for All Audiences: Due to limited content filtering, the model’s outputs may be inappropriate for public settings, underage users, or applications requiring high security.

  • Legal and Ethical Responsibilities: Users must ensure their usage complies with local laws and ethical standards. Generated content may carry legal or ethical risks, and users are solely responsible for any consequences.

  • Research and Experimental Use: It is recommended to use this model for research, testing, or controlled environments, avoiding direct use in production or public-facing commercial applications.

  • Monitoring and Review Recommendations: Users are strongly advised to monitor model outputs in real-time and conduct manual reviews when necessary to prevent the dissemination of inappropriate content.

  • No Default Safety Guarantees: Unlike standard models, this model has not undergone rigorous safety optimization. huihui.ai bears no responsibility for any consequences arising from its use.

Donation

Your donation helps us continue our further development and improvement, a cup of coffee can do it.
  • bitcoin:
  bc1qqnkhuchxw0zqjh2ku3lu4hq45hc6gy84uk70ge
  • Support our work on Ko-fi!

Author: huihui-ai

Likes: 6

Downloads: 0

Tags: transformers, safetensors, qwen3_5, image-text-to-text, abliterated, uncensored, Qwopus, reasoning, chain-of-thought, Dense, conversational, base_model:Jackrong/Qwopus3.5-27B-v3, base_model:finetune:Jackrong/Qwopus3.5-27B-v3, license:apache-2.0, endpoints_compatible, region:us

caiovicentino1/Huihui-Qwopus3.5-27B-v3-abliterated-PolarQuant-Q5


license: apache-2.0 tags:

  • polarquant
  • qwen3.5
  • abliterated
  • uncensored
  • qwopus base_model: huihui-ai/Huihui-Qwopus3.5-27B-v3-abliterated arxiv: "2603.29078"

Qwopus3.5-27B-v3 Abliterated — PolarQuant Q5

PQ5+INT4 weights + Q3 KV cache — uncensored reasoning model on RTX 4090.

Quick Start

pip install polarquant
import polarengine_vllm
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "caiovicentino1/Huihui-Qwopus3.5-27B-v3-abliterated-PolarQuant-Q5",
    device_map="auto"
)

Peak VRAM

VRAM

Download Size

Download

Results

| Metric | FP16 KV | Q3 KV | Q2 KV | |---|---|---|---| | Speed | 21.8 tok/s | 22.1 tok/s | 22.0 tok/s | | Peak VRAM | 70.4 GB | 19.1 GB | 19.1 GB |

Fits RTX 4090 (24 GB) with Q3 KV cache!

| Component | Value | |---|---| | Model VRAM | 19.1 GB | | Polar Codes | 16.4 GB (bit-packed) | | cos_sim | >0.998 |

Architecture

  • Qwen3.5-27B hybrid (attention + linear attention)
  • 64 layers, head_dim=128
  • Abliterated (uncensored) by huihui-ai
  • Base: Jackrong/Qwopus3.5-27B-v3 (Claude Opus reasoning distilled)
  • Apache 2.0

Links

Author: caiovicentino1

Likes: 4

Downloads: 0

Tags: safetensors, qwen3_5, polarquant, qwen3.5, abliterated, uncensored, qwopus, arxiv:2603.29078, base_model:huihui-ai/Huihui-Qwopus3.5-27B-v3-abliterated, base_model:quantized:huihui-ai/Huihui-Qwopus3.5-27B-v3-abliterated, license:apache-2.0, compressed-tensors, region:us

mradermacher/Huihui-Qwopus3.5-27B-v3-abliterated-GGUF


base_model: huihui-ai/Huihui-Qwopus3.5-27B-v3-abliterated language:

  • en library_name: transformers license: apache-2.0 mradermacher: readme_rev: 1 quantized_by: mradermacher tags:
  • abliterated
  • uncensored
  • Qwopus
  • reasoning
  • chain-of-thought
  • Dense

About

<!-- ### quantize_version: 2 --> <!-- ### output_tensor_quantised: 1 --> <!-- ### convert_type: hf --> <!-- ### vocab_type: --> <!-- ### tags: --> <!-- ### quants: x-f16 Q4_K_S Q2_K Q8_0 Q6_K Q3_K_M Q3_K_S Q3_K_L Q4_K_M Q5_K_S Q5_K_M IQ4_XS --> <!-- ### quants_skip: --> <!-- ### skip_mmproj: -->

static quants of https://huggingface.co/huihui-ai/Huihui-Qwopus3.5-27B-v3-abliterated

<!-- provided-files -->

For a convenient overview and download list, visit our model page for this model.

weighted/imatrix quants seem not to be available (by me) at this time. If they do not show up a week or so after the static ones, I have probably not planned for them. Feel free to request them by opening a Community Discussion.

Usage

If you are unsure how to use GGUF files, refer to one of TheBloke's READMEs for more details, including on how to concatenate multi-part files.

Provided Quants

(sorted by size, not necessarily quality. IQ-quants are often preferable over similar sized non-IQ quants)

| Link | Type | Size/GB | Notes | |:-----|:-----|--------:|:------| | GGUF | mmproj-Q8_0 | 0.7 | multi-modal supplement | | GGUF | mmproj-f16 | 1.0 | multi-modal supplement | | GGUF | Q2_K | 10.8 | | | GGUF | Q3_K_S | 12.2 | | | GGUF | Q3_K_M | 13.4 | lower quality | | GGUF | Q3_K_L | 14.4 | | | GGUF | IQ4_XS | 15.3 | | | GGUF | Q4_K_S | 15.7 | fast, recommended | | GGUF | Q4_K_M | 16.6 | fast, recommended | | GGUF | Q5_K_S | 18.8 | | | GGUF | Q5_K_M | 19.3 | | | GGUF | Q6_K | 22.2 | very good quality | | GGUF | Q8_0 | 28.7 | fast, best quality |

Here is a handy graph by ikawrakow comparing some lower-quality quant types (lower is better):

image.png

And here are Artefact2's thoughts on the matter: https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9

FAQ / Model Request

See https://huggingface.co/mradermacher/model_requests for some answers to questions you might have and/or if you want some other model quantized.

Thanks

I thank my company, nethype GmbH, for letting me use its servers and providing upgrades to my workstation to enable this work in my free time.

<!-- end -->

Author: mradermacher

Likes: 3

Downloads: 0

Tags: transformers, gguf, abliterated, uncensored, Qwopus, reasoning, chain-of-thought, Dense, en, base_model:huihui-ai/Huihui-Qwopus3.5-27B-v3-abliterated, base_model:quantized:huihui-ai/Huihui-Qwopus3.5-27B-v3-abliterated, license:apache-2.0, endpoints_compatible, region:us, conversational

barozp/gemma-4-21b-a4b-it-REAP-GGUF


base_model:

  • 0xSero/gemma-4-21b-a4b-it-REAP language:
  • en pipeline_tag: text-generation tags:
  • gguf
  • gemma4
  • image-text-to-text
  • moe
  • pruning
  • reap
  • cerebras
  • expert-pruning
  • conversational
  • llama-cpp
  • quantized license: gemma arxiv: 2510.13999

gemma-4-21b-a4b-it-REAP — GGUF Quantizations

GGUF quantizations of 0xSero/gemma-4-21b-a4b-it-REAP, a 20% expert-pruned variant of google/gemma-4-26b-a4b-it using the REAP (Router-weighted Expert Activation Pruning) method.

Available Files

| File | Quant | Size | BPW | Description | | --- | --- | --- | --- | --- | | gemma-4-21b-a4b-it-REAP-BF16.gguf | BF16 | ~43 GB | 16.0 | Full precision, for re-quantization | | gemma-4-21b-a4b-it-REAP-Q8_0.gguf | Q8_0 | ~22 GB | 8.0 | Near-lossless, large file | | gemma-4-21b-a4b-it-REAP-Q6_K.gguf | Q6_K | ~17 GB | 6.56 | Near-lossless, recommended for high quality | | gemma-4-21b-a4b-it-REAP-Q5_K_M.gguf | Q5_K_M | ~16 GB | 5.68 | High quality, larger size | | gemma-4-21b-a4b-it-REAP-Q5_K_S.gguf | Q5_K_S | ~15 GB | 5.52 | High quality, slightly smaller | | gemma-4-21b-a4b-it-REAP-Q4_K_M.gguf | Q4_K_M | ~14 GB | 4.89 | Recommended — best quality/size balance | | gemma-4-21b-a4b-it-REAP-Q4_K_S.gguf | Q4_K_S | ~13 GB | 4.63 | 4-bit small | | gemma-4-21b-a4b-it-REAP-Q3_K_L.gguf | Q3_K_L | ~11 GB | 4.27 | 3-bit large | | gemma-4-21b-a4b-it-REAP-Q3_K_M.gguf | Q3_K_M | ~11 GB | 3.91 | 3-bit medium | | gemma-4-21b-a4b-it-REAP-Q3_K_S.gguf | Q3_K_S | ~10 GB | 3.66 | 3-bit small | | gemma-4-21b-a4b-it-REAP-Q2_K.gguf | Q2_K | ~9 GB | 2.96 | Smallest size, lowest quality |

Model Details

| Property | Value | | --- | --- | | Architecture | Gemma 4 (hybrid sliding/full attention MoE) | | Parameters | 21.34B total / ~4B active per token | | Experts | 103 total / 8 active per token | | Context Length | 262,144 tokens | | Original dtype | BF16 | | Quantization tool | llama.cpp | | License | Gemma |

Quantization Process

# 1. Convert BF16 SafeTensors → GGUF
python convert_hf_to_gguf.py 0xSero/gemma-4-21b-a4b-it-REAP \
  --outfile gemma-4-21b-a4b-it-REAP-BF16.gguf \
  --outtype bf16

# 2. Quantize (example: Q4_K_M)
llama-quantize gemma-4-21b-a4b-it-REAP-BF16.gguf \
  gemma-4-21b-a4b-it-REAP-Q4_K_M.gguf Q4_K_M

Usage

llama.cpp

llama-cli \
  -m gemma-4-21b-a4b-it-REAP-Q4_K_M.gguf \
  -ngl 99 -c 4096 \
  -p "Your prompt here"

llama-server (OpenAI-compatible API)

llama-server \
  -m gemma-4-21b-a4b-it-REAP-Q4_K_M.gguf \
  -ngl 99 -c 4096 \
  --port 8080

LM Studio / Jan / Ollama

Download the .gguf file and load it directly in your preferred local inference UI.

Hardware Requirements

| Config | VRAM / RAM | | --- | --- | | Full GPU (Q4_K_M, recommended) | 16+ GB VRAM | | Hybrid CPU+GPU (Q4_K_M) | 8 GB VRAM + 12 GB RAM | | CPU only (Q4_K_M) | 18+ GB RAM |

About the Original Model

0xSero/gemma-4-21b-a4b-it-REAP applies REAP expert pruning (arXiv:2510.13999) to remove 20% of MoE experts (25 of 128 per layer) from Gemma 4 26B-A4B-it, while preserving routing behavior. Active parameters per token remain unchanged at ~4B. The result is an ~18% smaller model with near-identical generation quality across coding, math, and reasoning benchmarks.

License

Gemma — see Google's Gemma Terms of Use.

Author: barozp

Likes: 3

Downloads: 0

Tags: gguf, gemma4, image-text-to-text, moe, pruning, reap, cerebras, expert-pruning, conversational, llama-cpp, quantized, text-generation, en, arxiv:2510.13999, base_model:0xSero/gemma-4-21b-a4b-it-REAP, base_model:quantized:0xSero/gemma-4-21b-a4b-it-REAP, license:gemma, endpoints_compatible, region:us

hadadxyz/Qwen3-4B-Diversity


base_model:

  • Qwen/Qwen3-4B

tags:

  • distillation
  • distilled
  • sft
  • peft
  • qwen3

datasets:

  • ianncity/KIMI-K2.5-550000x
  • Jackrong/Qwen3.5-reasoning-700x
  • nohurry/Opus-4.6-Reasoning-3000x-filtered
  • TeichAI/claude-4.5-opus-high-reasoning-250x
  • TeichAI/gemini-3-pro-preview-high-reasoning-250x
  • TeichAI/claude-haiku-4.5-high-reasoning-1700x
  • TeichAI/gpt-5.2-high-reasoning-250x
  • Roman1111111/gemini-3.1-pro-hard-high-reasoning
  • Jackrong/glm-4.7-multiturn-CoT
  • bmeyer2025/glm5-reasoning-traces
  • TeichAI/claude-sonnet-4.5-high-reasoning-250x
  • TeichAI/deepseek-v3.2-speciale-openr1-math-3k
  • TeichAI/deepseek-v3.2-speciale-OpenCodeReasoning-3k
  • TeichAI/deepseek-v3.2-speciale-1000x
  • TeichAI/gpt-5-codex-1000x

model-index:

  • name: hadadxyz/Qwen3-4B-Diversity results:
    • task: type: text-generation name: Text Generation dataset: name: Mmlu type: cais/mmlu metrics:
      • type: acc value: 67.8 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu Humanities type: cais/mmlu metrics:
      • type: acc value: 57.9 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu Formal Logic type: cais/mmlu metrics:
      • type: acc value: 58.7 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu High School European History type: cais/mmlu metrics:
      • type: acc value: 78.2 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu High School Us History type: cais/mmlu metrics:
      • type: acc value: 84.8 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu High School World History type: cais/mmlu metrics:
      • type: acc value: 83.1 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu International Law type: cais/mmlu metrics:
      • type: acc value: 77.7 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu Jurisprudence type: cais/mmlu metrics:
      • type: acc value: 78.7 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu Logical Fallacies type: cais/mmlu metrics:
      • type: acc value: 82.8 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu Moral Disputes type: cais/mmlu metrics:
      • type: acc value: 71.1 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu Moral Scenarios type: cais/mmlu metrics:
      • type: acc value: 28.4 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu Philosophy type: cais/mmlu metrics:
      • type: acc value: 73.3 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu Prehistory type: cais/mmlu metrics:
      • type: acc value: 76.2 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu Professional Law type: cais/mmlu metrics:
      • type: acc value: 47.4 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu World Religions type: cais/mmlu metrics:
      • type: acc value: 78.4 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu Other type: cais/mmlu metrics:
      • type: acc value: 72.1 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu Business Ethics type: cais/mmlu metrics:
      • type: acc value: 73.0 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu Clinical Knowledge type: cais/mmlu metrics:
      • type: acc value: 75.5 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu College Medicine type: cais/mmlu metrics:
      • type: acc value: 71.1 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu Global Facts type: cais/mmlu metrics:
      • type: acc value: 41.0 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu Human Aging type: cais/mmlu metrics:
      • type: acc value: 67.7 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu Management type: cais/mmlu metrics:
      • type: acc value: 84.5 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu Marketing type: cais/mmlu metrics:
      • type: acc value: 85.5 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu Medical Genetics type: cais/mmlu metrics:
      • type: acc value: 75.0 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu Miscellaneous type: cais/mmlu metrics:
      • type: acc value: 79.7 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu Nutrition type: cais/mmlu metrics:
      • type: acc value: 74.8 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu Professional Accounting type: cais/mmlu metrics:
      • type: acc value: 55.0 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu Professional Medicine type: cais/mmlu metrics:
      • type: acc value: 71.7 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu Virology type: cais/mmlu metrics:
      • type: acc value: 53.0 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu Social Sciences type: cais/mmlu metrics:
      • type: acc value: 78.4 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu Econometrics type: cais/mmlu metrics:
      • type: acc value: 64.0 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu High School Geography type: cais/mmlu metrics:
      • type: acc value: 84.3 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu High School Government And Politics type: cais/mmlu metrics:
      • type: acc value: 87.0 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu High School Macroeconomics type: cais/mmlu metrics:
      • type: acc value: 74.6 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu High School Microeconomics type: cais/mmlu metrics:
      • type: acc value: 80.7 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu High School Psychology type: cais/mmlu metrics:
      • type: acc value: 87.2 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu Human Sexuality type: cais/mmlu metrics:
      • type: acc value: 75.6 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu Professional Psychology type: cais/mmlu metrics:
      • type: acc value: 71.2 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu Public Relations type: cais/mmlu metrics:
      • type: acc value: 71.8 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu Security Studies type: cais/mmlu metrics:
      • type: acc value: 74.3 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu Sociology type: cais/mmlu metrics:
      • type: acc value: 84.1 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu Us Foreign Policy type: cais/mmlu metrics:
      • type: acc value: 81.0 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu Stem type: cais/mmlu metrics:
      • type: acc value: 68.1 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu Abstract Algebra type: cais/mmlu metrics:
      • type: acc value: 45.0 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu Anatomy type: cais/mmlu metrics:
      • type: acc value: 61.5 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu Astronomy type: cais/mmlu metrics:
      • type: acc value: 78.9 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu College Biology type: cais/mmlu metrics:
      • type: acc value: 83.3 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu College Chemistry type: cais/mmlu metrics:
      • type: acc value: 54.0 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu College Computer Science type: cais/mmlu metrics:
      • type: acc value: 69.0 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu College Mathematics type: cais/mmlu metrics:
      • type: acc value: 58.0 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu College Physics type: cais/mmlu metrics:
      • type: acc value: 53.9 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu Computer Security type: cais/mmlu metrics:
      • type: acc value: 80.0 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu Conceptual Physics type: cais/mmlu metrics:
      • type: acc value: 77.0 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu Electrical Engineering type: cais/mmlu metrics:
      • type: acc value: 76.6 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu Elementary Mathematics type: cais/mmlu metrics:
      • type: acc value: 65.6 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu High School Biology type: cais/mmlu metrics:
      • type: acc value: 86.1 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu High School Chemistry type: cais/mmlu metrics:
      • type: acc value: 70.4 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu High School Computer Science type: cais/mmlu metrics:
      • type: acc value: 86.0 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu High School Mathematics type: cais/mmlu metrics:
      • type: acc value: 42.6 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu High School Physics type: cais/mmlu metrics:
      • type: acc value: 62.9 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu High School Statistics type: cais/mmlu metrics:
      • type: acc value: 71.3 name: accuracy
    • task: type: text-generation name: Text Generation dataset: name: Mmlu Machine Learning type: cais/mmlu metrics:
      • type: acc value: 57.1 name: accuracy

pipeline_tag: text-generation

library_name: transformers

license: apache-2.0

license_link: https://huggingface.co/hadadxyz/Qwen3-4B-Diversity/blob/main/LICENSE

Introduction

MMLU

Qwen3-4B-Diversity is a fine-tuned language model based on Qwen/Qwen3-4B that has been trained on a diverse collection of high-quality reasoning datasets. This model combines knowledge distilled from various state-of-the-art AI systems to provide enhanced reasoning capabilities across multiple domains including mathematics, coding, general problem-solving, and multi-turn conversations.

Training Configuration

The model was trained using supervised fine-tuning techniques with parameter-efficient methods to optimize performance while maintaining computational efficiency. Key training parameters include:

| Parameter | Value | |------------------|--------| | Number of Epochs | 2 | | Context Length | 40,960 |

Hardware and Resources

| Resource | Specification | |-------------------|------------------------| | GPU | A100-80GB | | Training Duration | Approximately 17 hours | | Estimated Cost | $27 to $30 |

Training Data

| Dataset | Rows Used | Model | |--------------------------------------------------------------------------------------------------------------------------------------------|------------|------------------------------------| | ianncity/KIMI-K2.5-550000x (General-Distillation) | 1,000 | Kimi K2.5 | | Jackrong/Qwen3.5-reasoning-700x | 633 | Qwen3.5 | | nohurry/Opus-4.6-Reasoning-3000x-filtered | 2,326 | Claude Opus 4.6 | | TeichAI/claude-4.5-opus-high-reasoning-250x | 250 | Claude Opus 4.5 | | TeichAI/gemini-3-pro-preview-high-reasoning-250x | 248 | Gemini 3 Pro | | TeichAI/claude-haiku-4.5-high-reasoning-1700x | 1,688 | Claude Haiku 4.5 | | TeichAI/gpt-5.2-high-reasoning-250x | 249 | GPT-5.2 | | Roman1111111/gemini-3.1-pro-hard-high-reasoning | 3,150 | Gemini 3.1 Pro | | Jackrong/glm-4.7-multiturn-CoT | 5,090 | GLM-4.7 | | bmeyer2025/glm5-reasoning-traces | 1,744 | GLM-5 | | TeichAI/claude-sonnet-4.5-high-reasoning-250x | 247 | Claude Sonnet 4.5 | | TeichAI/deepseek-v3.2-speciale-openr1-math-3k | 3,317 | DeepSeek V3.2-Speciale | | TeichAI/deepseek-v3.2-speciale-OpenCodeReasoning-3k | 2,953 | DeepSeek V3.2-Speciale | | TeichAI/deepseek-v3.2-speciale-1000x | 991 | DeepSeek V3.2-Speciale | | TeichAI/gpt-5-codex-1000x | 991 | GPT-5 Codex | | Total | 24,877 | Combined diverse reasoning dataset |

Model Capabilities

This model excels in several key areas:

  1. Advanced Reasoning: The model can break down complex problems into steps and provide detailed reasoning processes.

  2. Mathematical Problem Solving: Enhanced capabilities for mathematical reasoning and problem-solving through dedicated math-focused datasets.

  3. Code Generation and Understanding: Improved coding abilities from multiple code-reasoning datasets including DeepSeek and GPT-5 Codex data.

  4. Multi-Turn Conversations: Better handling of extended dialogues and context-aware responses.

  5. Domain Versatility: Exposure to reasoning patterns from various AI systems provides flexibility across different domains and task types.

Usage

Quick Demo

If you are looking for a quick demo that is completely free and without any cost, you can use Google Colab.

Ollama (Local)

# https://ollama.com/hadad/qwen3-4bd

# hadad/qwen3-4bd:Q8_0  |  4.3GB
# hadad/qwen3-4bd:BF16  |  8.1GB

# ollama pull hadad/qwen3-4bd:Q8_0

ollama run hadad/qwen3-4bd:Q8_0

If you are using Ollama and are interested in tools or function calling, it is recommended to use the OpenAI-compatible API provided by Ollama. This approach is more powerful.

Refer to the Ollama documentation.

Python (Local)

#pip install transformers==4.56.2
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "hadadxyz/Qwen3-4B-Diversity"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content)
print("content:", content)

Inference Parameters

For optimal results, we recommend the following generation parameters:

Thinking

| Parameter | Recommended Value | Description | |-----------------|-------------------|------------------------------------------| | temperature | 0.6 | Controls randomness in generation | | top_p | 0.95 | Nucleus sampling threshold | | top_k | 20 | Top-k sampling parameter | | min_p | 0 | Minimum probability threshold |

Non-Thinking

| Parameter | Recommended Value | Description | |-----------------|-------------------|------------------------------------------| | temperature | 0.7 | Controls randomness in generation | | top_p | 0.8 | Nucleus sampling threshold | | top_k | 20 | Top-k sampling parameter | | min_p | 0 | Minimum probability threshold |

Citation

If you use this model in your research or applications, please cite both this model and the base model:

@misc{qwen3-4b-diversity,
  author = {hadadxyz},
  title  = {Qwen3-4B-Diversity},
  year   = {2026},
  url    = {https://huggingface.co/hadadxyz/Qwen3-4B-Diversity}
}

Acknowledgments

This model was made possible through the combination of multiple high-quality datasets from the community. We acknowledge and thank all dataset creators and the Qwen team for providing the excellent base model.

Author: hadadxyz

Likes: 3

Downloads: 0

Tags: transformers, safetensors, qwen3, text-generation, distillation, distilled, sft, peft, conversational, dataset:ianncity/KIMI-K2.5-550000x, dataset:Jackrong/Qwen3.5-reasoning-700x, dataset:nohurry/Opus-4.6-Reasoning-3000x-filtered, dataset:TeichAI/claude-4.5-opus-high-reasoning-250x, dataset:TeichAI/gemini-3-pro-preview-high-reasoning-250x, dataset:TeichAI/claude-haiku-4.5-high-reasoning-1700x, dataset:TeichAI/gpt-5.2-high-reasoning-250x, dataset:Roman1111111/gemini-3.1-pro-hard-high-reasoning, dataset:Jackrong/glm-4.7-multiturn-CoT, dataset:bmeyer2025/glm5-reasoning-traces, dataset:TeichAI/claude-sonnet-4.5-high-reasoning-250x, dataset:TeichAI/deepseek-v3.2-speciale-openr1-math-3k, dataset:TeichAI/deepseek-v3.2-speciale-OpenCodeReasoning-3k, dataset:TeichAI/deepseek-v3.2-speciale-1000x, dataset:TeichAI/gpt-5-codex-1000x, base_model:Qwen/Qwen3-4B, base_model:finetune:Qwen/Qwen3-4B, license:apache-2.0, model-index, text-generation-inference, endpoints_compatible, region:us

samuelcardillo/Qwopus-MoE-35B-A3B


language:

  • en
  • zh license: apache-2.0 tags:
  • qwen3.5
  • moe
  • reasoning
  • distillation
  • claude-opus
  • qlora
  • unsloth base_model: Qwen/Qwen3.5-35B-A3B datasets:
  • nohurry/Opus-4.6-Reasoning-3000x-filtered
  • Jackrong/Qwen3.5-reasoning-700x
  • Roman1111111/claude-opus-4.6-10000x

Qwopus MoE 35B-A3B — Claude Opus 4.6 Reasoning Distilled

QLoRA fine-tune of Qwen3.5-35B-A3B (MoE, 3B active parameters) with Claude Opus 4.6 reasoning distillation. Training recipe adapted from Jackrong's Qwopus3.5-27B-v3.

This is the full BF16 safetensors model. For GGUF quantizations (Q4, Q5, Q6, Q8), see samuelcardillo/Qwopus-MoE-35B-A3B-GGUF.

Credits

This model is based on the work of Jackrong and his Qwopus3.5-27B-v3 training methodology — same datasets, same philosophy, adapted for the MoE architecture. See his complete training guide.

Model Details

| Property | Value | |---|---| | Base Model | Qwen/Qwen3.5-35B-A3B | | Architecture | Mixture of Experts (MoE) | | Total Parameters | ~35B | | Active Parameters | ~3B per token | | Precision | BF16 |

Training Details

| Parameter | Value | |---|---| | Method | QLoRA (4-bit base + LoRA in BF16) | | Framework | Unsloth 2026.4.2 + TRL | | LoRA Rank | 32 | | LoRA Alpha | 32 | | LoRA Targets | q_proj, k_proj, v_proj, o_proj | | Trainable Parameters | 6,881,280 (0.02%) | | Epochs | 2 | | Final Loss | 0.5517 | | GPU | NVIDIA RTX PRO 6000 Blackwell (96GB) | | Training Time | ~3.5 hours |

Datasets (3,209 examples)

| Dataset | Examples | |---|---| | nohurry/Opus-4.6-Reasoning-3000x-filtered | 2,326 | | Jackrong/Qwen3.5-reasoning-700x | 633 | | Roman1111111/claude-opus-4.6-10000x | ~250 |

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "samuelcardillo/Qwopus-MoE-35B-A3B",
    torch_dtype="bfloat16",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "samuelcardillo/Qwopus-MoE-35B-A3B",
    trust_remote_code=True,
)

Acknowledgements

  • Jackrong — Training methodology and the Qwopus concept
  • Unsloth — QLoRA training framework
  • Qwen — Base model

Author: samuelcardillo

Likes: 3

Downloads: 0

Tags: safetensors, qwen3_5_moe, qwen3.5, moe, reasoning, distillation, claude-opus, qlora, unsloth, en, zh, dataset:nohurry/Opus-4.6-Reasoning-3000x-filtered, dataset:Jackrong/Qwen3.5-reasoning-700x, dataset:Roman1111111/claude-opus-4.6-10000x, base_model:Qwen/Qwen3.5-35B-A3B, base_model:finetune:Qwen/Qwen3.5-35B-A3B, license:apache-2.0, region:us