Todays AI Summary

AI Developments: New Models for Audio, Image Editing, and Embeddings Emerge

Here's a look at the latest developments in AI, covering new models and research papers:

Research Papers

Recent research papers introduce advancements across various AI domains:

  • ChronoGraph: Introduces a new real-world graph-based multivariate time series dataset for forecasting in microservice systems. It includes anomaly labels for evaluating anomaly detection methods.
  • Delta Activations: Proposes a method to represent fine-tuned language models as vector embeddings, enabling effective clustering by domain and task.
  • DEXOP: Presents a device for robotic data collection that sensorizes and records human manipulation, maximizing the transferability of the data to real robots.
  • ArcMemo: Introduces a method for abstract reasoning composition with lifelong LLM memory, improving performance on the ARC-AGI benchmark.
  • Unified View of LLM Post-Training: Derives a Unified Policy Gradient Estimator, and presents the calculations of a wide spectrum of post-training approaches as the gradient of a common objective under different data distribution assumptions and various bias-variance tradeoffs.
  • AI Bias in Resume Screening: Examines how biased AI models can influence human decision-making in resume screening, highlighting the need for careful design and oversight of AI hiring systems.
  • IPA: Introduces a feature-aware projection framework for efficient foundation model adaptation, improving performance over LoRA and DoRA on language and vision benchmarks.
  • SSGaussian: Proposes a novel 3D style transfer pipeline that effectively integrates prior knowledge from pretrained 2D diffusion models.
  • SST-iTransformer: Proposes a novel approach with SST-iTransformer for parking availability prediction by fusing multi-source data.
  • PARCO: Introduces a phoneme-augmented robust contextual ASR system that improves recognition of domain-specific named entities.

Models

Several new AI models have been released:

  • NandemoGHS/Anime-XCodec2: This model is a fine-tuned variant of HKUSTAudio/xcodec2, trained on approximately 25,000 hours of Japanese anime/game-style voices. It focuses on improving the naturalness of speech reconstruction for this specific domain.
  • HancomInSpaceAI/HiEmbed_base_onnx_v1: A lightweight embedding model for vector search, designed for both Korean and English text retrieval. It demonstrates strong performance on Korean and English benchmarks, outperforming other models in overall leaderboard scores.
  • DavidAU/Qwen3-Horror-Instruct-Uncensored-262K-ctx-4B: A fine-tuned Qwen3 model for generating horror content. It is designed to increase depth and detail in prose generation.

Key Takeaways

  • Specialized Audio Models: Models like Anime-XCodec2 demonstrate the potential for fine-tuning existing architectures for specific audio styles and languages.
  • Embedding Models: HiEmbed_base_onnx_v1 showcases advancements in multilingual embedding models, achieving high performance in both Korean and English.
  • Ethical Considerations: The paper on AI bias in resume screening underscores the importance of addressing bias in AI systems to ensure fair and equitable outcomes.
  • Efficient Fine-Tuning: The IPA paper introduces a novel approach to parameter-efficient fine-tuning, improving performance while reducing the number of trainable parameters.

AI Papers for 2026-04-03

HippoCamp: Benchmarking Contextual Agents on Personal Computers

We present HippoCamp, a new benchmark designed to evaluate agents' capabilities on multimodal file management. Unlike existing agent benchmarks that focus on tasks like web interaction, tool use, or software automation in generic settings, HippoCamp evaluates agents in user-centric environments to model individual user profiles and search massive personal files for context-aware reasoning. Our benchmark instantiates device-scale file systems over real-world profiles spanning diverse modalities, comprising 42.4 GB of data across over 2K real-world files. Building upon the raw files, we construct 581 QA pairs to assess agents' capabilities in search, evidence perception, and multi-step reasoning. To facilitate fine-grained analysis, we provide 46.1K densely annotated structured trajectories for step-wise failure diagnosis. We evaluate a wide range of state-of-the-art multimodal large language models (MLLMs) and agentic methods on HippoCamp. Our comprehensive experiments reveal a significant performance gap: even the most advanced commercial models achieve only 48.3% accuracy in user profiling, struggling particularly with long-horizon retrieval and cross-modal reasoning within dense personal file systems. Furthermore, our step-wise failure diagnosis identifies multimodal perception and evidence grounding as the primary bottlenecks. Ultimately, HippoCamp exposes the critical limitations of current agents in realistic, user-centric environments and provides a robust foundation for developing next-generation personal AI assistants.

LAtent Phase Inference from Short time sequences using SHallow REcurrent Decoders (LAPIS-SHRED)

Reconstructing full spatio-temporal dynamics from sparse observations in both space and time remains a central challenge in complex systems, as measurements can be spatially incomplete and can be also limited to narrow temporal windows. Yet approximating the complete spatio-temporal trajectory is essential for mechanistic insight and understanding, model calibration, and operational decision-making. We introduce LAPIS-SHRED (LAtent Phase Inference from Short time sequence using SHallow REcurrent Decoders), a modular architecture that reconstructs and/or forecasts complete spatiotemporal dynamics from sparse sensor observations confined to short temporal windows. LAPIS-SHRED operates through a three-stage pipeline: (i) a SHRED model is pre-trained entirely on simulation data to map sensor time-histories into a structured latent space, (ii) a temporal sequence model, trained on simulation-derived latent trajectories, learns to propagate latent states forward or backward in time to span unobserved temporal regions from short observational time windows, and (iii) at deployment, only a short observation window of hyper-sparse sensor measurements from the true system is provided, from which the frozen SHRED model and the temporal model jointly reconstruct or forecast the complete spatiotemporal trajectory. The framework supports bidirectional inference, inherits data assimilation and multiscale reconstruction capabilities from its modular structure, and accommodates extreme observational constraints including single-frame terminal inputs. We evaluate LAPIS-SHRED on six experiments spanning complex spatio-temporal physics: turbulent flows, multiscale propulsion physics, volatile combustion transients, and satellite-derived environmental fields, highlighting a lightweight, modular architecture suited for operational settings where observation is constrained by physical or logistical limitations.

The Recipe Matters More Than the Kitchen:Mathematical Foundations of the AI Weather Prediction Pipeline

AI weather prediction has advanced rapidly, yet no unified mathematical framework explains what determines forecast skill. Existing theory addresses specific architectural choices rather than the learning pipeline as a whole, while operational evidence from 2023-2026 demonstrates that training methodology, loss function design, and data diversity matter at least as much as architecture selection. This paper makes two interleaved contributions. Theoretically, we construct a framework rooted in approximation theory on the sphere, dynamical systems theory, information theory, and statistical learning theory that treats the complete learning pipeline (architecture, loss function, training strategy, data distribution) rather than architecture alone. We establish a Learning Pipeline Error Decomposition showing that estimation error (loss- and data-dependent) dominates approximation error (architecture-dependent) at current scales. We develop a Loss Function Spectral Theory formalizing MSE-induced spectral blurring in spherical harmonic coordinates, and derive Out-of-Distribution Extrapolation Bounds proving that data-driven models systematically underestimate record-breaking extremes with bias growing linearly in record exceedance. Empirically, we validate these predictions via inference across ten architecturally diverse AI weather models using NVIDIA Earth2Studio with ERA5 initial conditions, evaluating six metrics across 30 initialization dates spanning all seasons. Results confirm universal spectral energy loss at high wavenumbers for MSE-trained models, rising Error Consensus Ratios showing that the majority of forecast error is shared across architectures, and linear negative bias during extreme events. A Holistic Model Assessment Score provides unified multi-dimensional evaluation, and a prescriptive framework enables mathematical evaluation of proposed pipelines before training.

$\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution

As LLM agents tackle increasingly complex tasks, a critical question is whether they can maintain strategic coherence over long horizons: planning under uncertainty, learning from delayed feedback, and adapting when early mistakes compound. We introduce $\texttt{YC-Bench}$, a benchmark that evaluates these capabilities by tasking an agent with running a simulated startup over a one-year horizon spanning hundreds of turns. The agent must manage employees, select task contracts, and maintain profitability in a partially observable environment where adversarial clients and growing payroll create compounding consequences for poor decisions. We evaluate 12 models, both proprietary and open source, across 3 seeds each. Only three models consistently surpass the starting capital of \$200K, with Claude Opus 4.6 achieving the highest average final funds at \$1.27 M, followed by GLM-5 at \$1.21 M at 11$\times$ lower inference cost. Scratchpad usage, the sole mechanism for persisting information across context truncation, is the strongest predictor of success, and adversarial client detection is the primary failure mode, accounting for $47\%$ of bankruptcies. Our analysis reveals that frontier models still fail through distinct failure modes such as over-parallelization, demonstrating the capability gaps for long-horizon performance. $\texttt{YC-Bench}$ is open-source, reproducible, and configurable.

CliffSearch: Structured Agentic Co-Evolution over Theory and Code for Scientific Algorithm Discovery

Scientific algorithm discovery is iterative: hypotheses are proposed, implemented, stress-tested, and revised. Current LLM-guided search systems accelerate proposal generation, but often under-represent scientific structure by optimizing code-only artifacts with weak correctness/originality gating. We present CliffSearch, an agentic evolutionary framework in which the core evolution operators (pair selection, crossover, mutation, and review) are implemented as LLM agents, and the loop is designed around three principles: (1) each node is a structured scientific artifact, instantiated in either theory+code or code_only mode, (2) reviewer judgments of correctness and originality are first-class selection gates alongside optimization of the benchmark metric of interest, and (3) mutation is split into exploration and correction pathways with distinct objectives. Exploration mutation imports ideas from adjacent scientific domains to increase novelty, while correction mutation performs targeted evidence-guided repair using reviewer signals over theory, code, benchmark results, and runtime errors. We illustrate the framework on three benchmark-grounded studies: transformer hyper-connection evolution, optimizer discovery on a fixed nanoGPT stack, and a smaller native-optimizer ablation. Across these settings, the same loop supports explicit metric direction, reproducible persistence, and reviewer-gated comparison of discoveries under controlled search conditions. The result is a discovery workflow that prioritizes scientific interpretability and correctness while optimizing task metrics under controlled novelty constraints, rather than maximizing candidate throughput alone. Full run artifacts, interactive visualizations, and exported best nodes for the reported studies are available at https://cliffsearch.ai .

Neural Harmonic Textures for High-Quality Primitive Based Neural Reconstruction

Primitive-based methods such as 3D Gaussian Splatting have recently become the state-of-the-art for novel-view synthesis and related reconstruction tasks. Compared to neural fields, these representations are more flexible, adaptive, and scale better to large scenes. However, the limited expressivity of individual primitives makes modeling high-frequency detail challenging. We introduce Neural Harmonic Textures, a neural representation approach that anchors latent feature vectors on a virtual scaffold surrounding each primitive. These features are interpolated within the primitive at ray intersection points. Inspired by Fourier analysis, we apply periodic activations to the interpolated features, turning alpha blending into a weighted sum of harmonic components. The resulting signal is then decoded in a single deferred pass using a small neural network, significantly reducing computational cost. Neural Harmonic Textures yield state-of-the-art results in real-time novel view synthesis while bridging the gap between primitive- and neural-field-based reconstruction. Our method integrates seamlessly into existing primitive-based pipelines such as 3DGUT, Triangle Splatting, and 2DGS. We further demonstrate its generality with applications to 2D image fitting and semantic reconstruction.

Therefore I am. I Think

We consider the question: when a large language reasoning model makes a choice, did it think first and then decide to, or decide first and then think? In this paper, we present evidence that detectable, early-encoded decisions shape chain-of-thought in reasoning models. Specifically, we show that a simple linear probe successfully decodes tool-calling decisions from pre-generation activations with very high confidence, and in some cases, even before a single reasoning token is produced. Activation steering supports this causally: perturbing the decision direction leads to inflated deliberation, and flips behavior in many examples (between 7 - 79% depending on model and benchmark). We also show through behavioral analysis that, when steering changes the decision, the chain-of-thought process often rationalizes the flip rather than resisting it. Together, these results suggest that reasoning models can encode action choices before they begin to deliberate in text.

ORBIT: Scalable and Verifiable Data Generation for Search Agents on a Tight Budget

Search agents, which integrate language models (LMs) with web search, are becoming crucial for answering complex user queries. Constructing training datasets for deep research tasks, involving multi-step retrieval and reasoning, remains challenging due to expensive human annotation, or cumbersome prerequisites. In this work, we introduce ORBIT, a training dataset with 20K reasoning-intensive queries with short verifiable answers, generated using a frugal framework without relying on paid API services. The modular framework relies on four stages: seed creation, question--answer pair generation, and two stages of verification: self and external. ORBIT spans 15 domains and each training pair requires 4--5 reasoning steps, with external search verification required from the complete web. We train Qwen3-4B as the base model on ORBIT using GRPO and evaluate it on Wikipedia question answering tasks. Extensive experiment results demonstrate that ORBIT-4B achieves strong performance among sub-4B LLMs as search agents, proving the utility of synthetic datasets. Our framework, code and datasets are open-sourced and available publicly.

A ROS 2 Wrapper for Florence-2: Multi-Mode Local Vision-Language Inference for Robotic Systems

Foundation vision-language models are becoming increasingly relevant to robotics because they can provide richer semantic perception than narrow task-specific pipelines. However, their practical adoption in robot software stacks still depends on reproducible middleware integrations rather than on model quality alone. Florence-2 is especially attractive in this regard because it unifies captioning, optical character recognition, open-vocabulary detection, grounding and related vision-language tasks within a comparatively manageable model size. This article presents a ROS 2 wrapper for Florence-2 that exposes the model through three complementary interaction modes: continuous topic-driven processing, synchronous service calls and asynchronous actions. The wrapper is designed for local execution and supports both native installation and Docker container deployment. It also combines generic JSON outputs with standard ROS 2 message bindings for detection-oriented tasks. A functional validation is reported together with a throughput study on several GPUs, showing that local deployment is feasible with consumer grade hardware. The repository is publicly available here: https://github.com/JEDominguezVidal/florence2_ros2_wrapper

Screening Is Enough

A core limitation of standard softmax attention is that it does not define a notion of absolute query--key relevance: attention weights are obtained by redistributing a fixed unit mass across all keys according to their relative scores. As a result, relevance is defined only relative to competing keys, and irrelevant keys cannot be explicitly rejected. We introduce Multiscreen, a language-model architecture built around a mechanism we call screening, which enables absolute query--key relevance. Instead of redistributing attention across all keys, screening evaluates each key against an explicit threshold, discarding irrelevant keys and aggregating the remaining keys, thereby removing global competition among keys. Across experiments, Multiscreen achieves comparable validation loss with approximately 40% fewer parameters than a Transformer baseline, enables stable optimization at substantially larger learning rates, maintains strong performance in long-context perplexity, shows little to no degradation in retrieval performance even far beyond the training context length, and reduces inference latency by up to 3.2$\times$ at 100K context length.

AI Models

Jackrong/Qwopus3.5-27B-v3-GGUF


language:

  • en
  • zh
  • ko license: apache-2.0 base_model: unsloth/Qwen3.5-27B tags:
  • unsloth
  • qwen
  • qwen3.5
  • reasoning
  • chain-of-thought
  • lora
  • competitive-programming pipeline_tag: image-text-to-text

๐ŸŒŸ Qwopus3.5-27B-v3

๐ŸŽฏ Motivation and Personal Opinion

Recent advances in language agents โ€” including systems such as OpenClaw โ€” have predominantly focused on improving reasoning accuracy through Chain-of-Thought (CoT) and self-reflection mechanisms, encouraging models to iteratively refine their reasoning before taking actions.

However, emerging evidence suggests that such "pre-action overthinking" is not always optimal for sequential decision-making. Instead, agent performance can be more effectively improved through a trial-and-error paradigm, where actions are executed early and refined based on environmental feedback.

๐Ÿ”ฌ Supporting Evidence

  • Reflexion[^1] demonstrates that agents can significantly improve decision-making by leveraging trial, error, and self-reflection โ€” shifting the role of reflection from pre-action deliberation to post-action correction, enabling agents to learn from concrete execution outcomes rather than speculative reasoning.

  • Post-failure reflection + retry[^2] substantially boosts performance:

    • ๐Ÿ“ˆ +34.7% on mathematical reasoning tasks
    • ๐Ÿ“ˆ +18.1% on function calling tasks

    This provides strong empirical evidence that reflection is most effective when grounded in execution outcomes, rather than purely internal reasoning.

๐Ÿงญ My Approach

For multi-step and tool-augmented agent systems, performance should not be optimized solely through deeper pre-execution reasoning. A more effective strategy is an execution-driven optimization loop โ€” where agents perform lightweight initial reasoning, act in the environment, and iteratively refine their behavior based on feedback signals.

Paradigm Shift: from "reason-then-act" โ†’ "act-then-refine"

The objective is not to achieve optimal reasoning in a single pass, but to enable robust task completion through iterative interaction and correction.


๐Ÿ’ก Model Introduction

Qwopus3.5-27B-v3 is a reasoning-enhanced model based on Qwen3.5-27B, designed to simultaneously improve reasoning stability and correctness while optimizing inference efficiency โ€” ultimately achieving stronger cross-task generalization capabilities, particularly in programming.

Key Highlights:

  • ๐Ÿงฉ Structural Reasoning Optimization โ€” Refines the fundamental structure of the reasoning process through high-quality reasoning distillation and structural alignment, enabling higher accuracy rates via shorter, more stable reasoning paths.
  • ๐Ÿ”ง Tool-Calling Reinforcement โ€” Incorporates specialized RL training for tool-calling, optimized for tool-augmented agent frameworks like OpenClaw, strengthening stability in continuous task execution and proficiency in tool invocation.
  • ๐Ÿ” Act-Then-Refine Paradigm โ€” Designed for complex, multi-step agentic workflows, aligning with the core motivation of replacing pre-action deliberation with execution-driven refinement.

๐Ÿ”— Chain-of-Thought Optimization

๐Ÿšง The Problem with v2 Distillation

The v2 model was primarily trained through SFT on CoT data distilled from strong teacher models such as Claude. While this can transfer highโ€‘quality reasoning patterns, CoT traces from thirdโ€‘party datasets do not always faithfully reflect a modelโ€™s true internal reasoning process โ€” and after analysis, I found some portions may even be โ€œfabricatedโ€, meaning the traces were not actually generated by the claimed teacher model.[^3][^4]

Prior work further shows that CoT explanations can act as post-hoc rationalizations rather than genuine step-by-step reasoning[^3]. As a result, student models risk learning:

  • Surface-level pattern matching instead of underlying reasoning
  • Answer memorization rather than generalizable problem-solving
  • Reduced robustness on out-of-distribution tasks

โœ… What v3 Does Differently

| | v2 (Distillation) | v3 (Structural Alignment) | |---|---|---| | CoT Source | Third-party distilled traces | Curated, verifiable reasoning chains | | Learning Target | Imitate teacher outputs | Learn process-level reasoning | | Reasoning Style | Compressed, potentially fabricated | Explicit, step-by-step, faithful | | Robustness | Lower on unseen tasks | Higher generalization |

v3 focuses on improving the faithfulness, completeness, and structural clarity of reasoning traces. Instead of imitating compressed teacher CoT, the model is trained to produce more explicit and verifiable intermediate steps โ€” enabling a transition from โ€œanswer imitationโ€ to process-level reasoning learning.

This improves both the interpretability and reliability of the reasoning process, providing a more stable foundation for downstream multi-step and agent-based tasks.

โš ๏ธ Side Effect: The generated CoT length in v3 will be slightly longer than v2, as a direct consequence of more explicit intermediate reasoning.


๐ŸŽ Qwopus3.5-27B-v3: Humaneval Benchmark Evaluation

๐Ÿ”ฌ Inference Setup: All models were evaluated under the Unsloth runtime using bfloat16 (BF16) precision โ€” optimally balanced for numerical range and memory efficiency at 27B scale. Answer verification, partial CoT adjudication, and statistical analysis were cross-validated by GPT-4.5-Pro (Thinking) and Claude Opus 4.6 (Thinking) to ensure reproducibility.

๐Ÿ“Š HumanEval โ€” 164-Task Full Benchmark

Three 27B-scale Qwen-family models were evaluated under a conservative manual adjudication protocol, addressing:

  • ๐Ÿงน Code-extraction pollution
  • โœ‚๏ธ Answer / code separation issues
  • ๐Ÿ—‚๏ธ Formatting noise in otherwise correct outputs

๐Ÿ† Result: Under this fair and strict evaluation setting, Qwopus3.5-27B-v3 achieves the best strict overall score of 95.73% (157/164) โ€” outperforming Qwen3.5-27B (94.51%, 155/164) and Claude-Distilled-v2 (92.68%, 152/164), while simultaneously reducing the number of manual rescues required.

| Model | Base Pass | Plus Pass | vs. Qwen3.5-27B | | :--- | :---: | :---: | :---: | | ๐Ÿฅ‡ Qwopus3.5-27B-v3 | 97.56% (160/164) | 95.73% (157/164) | ๐Ÿ“ˆ +1.22 pp | | Qwen3.5-27B | 95.73% (157/164) | 94.51% (155/164) | โ€” Baseline โ€” | | Claude-Distilled-v2 | 95.12% (156/164) | 92.68% (152/164) | ๐Ÿ“‰ โˆ’1.83 pp |

Screenshot 2026-04-01 at 11.25.34โ€ฏPM

Screenshot 2026-04-02 at 8.23.13โ€ฏAM


๐Ÿ—บ๏ธ Training Pipeline Overview

Base Model (Qwen3.5-27B)
 โ”‚
 โ–ผ
Qwen3.5-27B fine-tuned with Unsloth
 โ”‚
 โ–ผ
Supervised Fine-Tuning (SFT) + LoRA
(Response-Only Training masked on "<|im_start|>assistant\n<think>")
 โ”‚
 โ–ผ
Qwopus3.5-27B-v3

๐Ÿง  Example of Learned Reasoning Scaffold

The model includes targeted optimizations addressing Qwen3.5's tendency toward excessive or repetitive reasoning on simple queries. By distilling the structured reasoning habits of top-tier models like Claude Opus, Qwopus3.5-27B-v3 adopts a highly organized, step-by-step cognitive layout.

Example๏ผšThe user is asking about [Topic] and how it differs from [Topic B]. This is a [Task type] question. Let me break this down:

1. What is [Topic A]?
   - [Fact/Mechanism 1]
   - [Fact/Mechanism 2]
2. What is [Topic B]?
   - [Fact/Mechanism 1]
3. Key differences:
   - [Comparison Point 1]
   - [Comparison Point 2]

Let me make sure to be accurate: [...]
Actually, I should double-check: is [Fact] used before [Fact]? Yes, typically...
Let me provide a clear, well-structured answer:

๐Ÿ“š Training Data

The model was fine-tuned on a high-fidelity reasoning dataset, which was meticulously curated from a blend of premium open-source sources on Hugging Face. This dataset is the result of a rigorous mixing and cleaning process, specifically designed to filter out low-quality responses and ensure consistently strong logical performance across diverse analytical domains.

(Rest assured, the entire process is strictly by-the-book and 100% compliant with all terms and open-source licenses!)

โš ๏ธ Limitations & Intended Use

  • Hallucination Risk: While reasoning is strong, the model remains an autoregressive LLM; external facts provided during the thinking sequence may occasionally contain hallucinations if verifying real-world events.
  • Intended Scenario: Best suited for offline analytical tasks, coding, math, and heavy logic-dependent prompting where the user needs to transparently follow the AI's internal logic.
  • This model is a test version intended solely for learning and demonstration purposes, and is for academic research and technical exploration use only.
  • Developer Disclaimer: This is an independent, personal project. Since the developer lacks the specialized technical resources and infrastructure of a large-scale industrial lab, the model's reasoning chain (CoT) may occasionally exhibit instability, logic loops, or reasoning drift. Users are advised to use this model with these experimental limitations in mind.

Note: The test results presented here differ from the scores on the 27B-v2 model card because the context length was increased for this evaluation. Consequently, the number of tasks affected by context window truncation has changed for each model, leading to different final scores. Please ensure comparisons are made under the same variable settings.

All post-evaluation standard result files will be uploaded to this repository for transparency and reproducibility. These include:

  • Jackrong_Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2_humaneval_all_evalonly_eval_results
  • Jackrong_Qwopus3.5-27B-v3-test1_humaneval_all_evalonly_eval_results
  • qwen_Qwen3.5-27B_humaneval_all_evalonly_eval_results

โš ๏ธ Note on evaluation artifacts.
The released result files are based on raw model generations, which may contain formatting issues (e.g., Markdown wrappers, answer/code mixing), truncation, or minor token-level corruption. As an independent project operating under limited resources, the evaluation scope here is intentionally focused rather than exhaustive โ€” a comprehensive, multi-domain assessment comparable to large institutional releases was not feasible. Capabilities beyond those benchmarked remain unverified, and users are encouraged to evaluate suitability against their own task requirements before adoption.

๐Ÿ™ Acknowledgements

Significant thanks to the Unsloth AI team for making rapid fine-tuning of large LLM models accessible. Additionally, we acknowledge Qwen internally, and the open-source community developers producing exceptional distilled datasets.

This qwen3_5 model was trained 2x faster with Unsloth and Huggingface's TRL library.

<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>

References

[^1]: Shinn, N., Cassano, F., Berman, E., Gopinath, A., Narasimhan, K., & Yao, S. (2023).
Reflexion: Language Agents with Verbal Reinforcement Learning.
arXiv:2303.11366.

[^2]: Bensal, S., Jamil, U., Bryant, C., Russak, M., Kamble, K., Mozolevskyi, D., Ali, M., & AlShikh, W. (2025).
Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning.
arXiv:2505.24726. https://arxiv.org/abs/2505.24726

[^3]: Anthropic (2025). Reasoning Models Don't Always Say What They Think.
https://www.anthropic.com/research/reasoning-models-dont-say-think

[^4]: Lyu et al. (2023). Faithful Chain-of-Thought Reasoning. ACL.
https://aclanthology.org/2023.ijcnlp-main.20/

๐Ÿ“– Citation

If you use this model in your research or projects, please cite:

@misc{jackrong_qwen35_27b_v3
  title        = {Jackrong/Qwopus3.5-27B-v3},
  author       = {Jackrong},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Jackrong/Qwopus3.5-27B-v3}}
}

Author: Jackrong

Likes: 76

Downloads: 0

Tags: gguf, unsloth, qwen, qwen3.5, reasoning, chain-of-thought, lora, competitive-programming, image-text-to-text, en, zh, ko, arxiv:2303.11366, arxiv:2505.24726, base_model:unsloth/Qwen3.5-27B, base_model:adapter:unsloth/Qwen3.5-27B, license:apache-2.0, endpoints_compatible, region:us, conversational

nvidia/Gemma-4-31B-IT-NVFP4


pipeline_tag: text-generation base_model:

  • google/gemma-4-31B-it license: other license_name: nvidia-open-model-license license_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license library_name: Model Optimizer tags:
  • nvidia
  • ModelOpt
  • Gemma-4-31B-IT
  • lighthouse
  • quantized
  • NVFP4

Model Overview

Description:

Gemma 4 31B IT is an open multimodal model built by Google DeepMind that handles text and image inputs, can process video as sequences of frames, and generates text output. It is designed to deliver frontier-level performance for reasoning, agentic workflows, coding, and multimodal understanding on consumer GPUs and workstations, with a 256K-token context window and support for over 140 languages. The model uses a hybrid attention mechanism that interleaves local sliding-window and full global attention, with unified Keys and Values in global layers and Proportional RoPE (p-RoPE) to support long-context performance. The NVIDIA Gemma 4 31B IT NVFP4 model is quantized with NVIDIA Model Optimizer.

This model is ready for commercial/non-commercial use. <br>

Third-Party Community Consideration

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements for this application and use case; see link to Non-NVIDIA Gemma 4 31B IT Model Card

License and Terms of Use:

GOVERNING TERMS: This trial service is governed by the NVIDIA API Trial Terms of Service. Use of this model is governed by the NVIDIA Open Model License Agreement. Additional Information: Apache License, Version 2.0.

Deployment Geography:

Global

Use Case:

Use Case: Designed for text generation, chatbots and conversational AI, text summarization, image data extraction, reasoning, coding, multimodal understanding, function calling, and research or educational use.

Release Date:

Hugging Face [04/02/2026] via [link] (https://huggingface.co/nvidia/Gemma-4-31B-IT-NVFP4)

Model Architecture:

Architecture Type: Transformer <br> Network Architecture: Gemma 4<br> Number of model parameters: 30.7B Vocabulary Size: 262,144

Input:

Input Type(s): Text, Image, Video <br> Input Format(s): String, Red, Green, Blue (RGB), Video (MP4/WebM) <br> Input Parameters: One-Dimensional (1D), Two-Dimensional (2D), Three-Dimensional (3D)<br> Other Properties Related to Input: Supports variable image aspect ratios and resolutions, configurable visual token budgets of 70, 140, 280, 560, and 1120, and video inputs up to 60 seconds at one frame per second. <br> Input Context Length (ISL): 256K

Output:

Output Type(s): Text <br> Output Format: String <br> Output Parameters: 1D (One Dimensional): Sequences <br> Other Properties Related to Output: Generates text responses for chat, reasoning, coding, multimodal understanding, and function-calling workflows.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Supported Runtime Engine(s): <br>

  • vLLM <br>

Supported Hardware Microarchitecture Compatibility: <br> NVIDIA Blackwell <br>

Preferred Operating System(s): <br>

  • Linux <br>

Model Version(s):

The model version is v1.0 which NVFP4 quantized with nvidia-modelopt v0.42.0 <br>

Training, Testing, and Evaluation Datasets:

We calibrated the model using the dataset noted below, and performed evaluation using the benchmarks noted under Evaluation Datasets. We did not perform training or testing for this Model Optimizer release. The methods noted under Training and Testing Datasets below represent the data collection and labeling methods used by the third-party to train and test the underlying Gemma 4 31B IT model.<br>

Calibration Dataset:

Link: cnn_dailymail<br> Data Collection Method by dataset: Automated. <br> Labeling Method by dataset: Automated. <br> Properties: The cnn_dailymail dataset is an English-language dataset containing just over 300k unique news articles as written by journalists at CNN and the Daily Mail.

Training Dataset Data Modality: Text, Image, Audio, Other (Code)<br> Training Data Collection: Automated<br> Training Labeling: Undisclosed<br> Training Properties: Large-scale multimodal pre-training data spanning web documents, code, images, and audio, with a cutoff date of January 2025 and coverage in over 140 languages. Data was filtered for CSAM, sensitive data, quality, and safety.<br>

Testing Dataset Testing Data Collection: Undisclosed<br> Testing Labeling: Undisclosed<br> Testing Properties: Undisclosed<br>

Evaluation Dataset:

Data Collection Method by dataset: Hybrid: Human, Automated <br> Labeling Method by dataset: Hybrid: Human, Automated <br> Properties: We evaluated the model on benchmarks including GPQA, which is a dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. <br>

Inference:

Engine: vLLM <br> Test Hardware: NVIDIA Hopper H100 <br>

Post Training Quantization

This model was obtained by quantizing the weights and activations of Gemma-4-31B-IT-NVFP4 to NVFP4 data type, ready for inference with vLLM.

Usage

To serve this checkpoint with vLLM and run the sample command below:

vllm serve /models/gemma-4-31b-it-nvfp4 --quantization modelopt --tensor-parallel-size 8

Evaluation Results:

| Benchmark | Baseline (ours) | NVFP4 | |---|---|---| | GPQA Diamond | 75.71% | 75.46% | | AIME 2025 | 66.25% | 65.94% | | MMLU Pro | 85.25% | 84.94% | | LiveCodeBench (pass@1) | 70.90% | 70.63% | | Scicode subtask acc (pass@1) | 33.61% | 33.18% | | Terminal-Bench Hard (pass@1) | 27.08% | 27.08% |

Model Limitations:

The base model was trained on data that contains toxic language and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

Author: nvidia

Likes: 26

Downloads: 0

Tags: Model Optimizer, safetensors, gemma4, nvidia, ModelOpt, Gemma-4-31B-IT, lighthouse, quantized, NVFP4, text-generation, conversational, base_model:google/gemma-4-31B-it, base_model:quantized:google/gemma-4-31B-it, license:other, modelopt, region:us

FINAL-Bench/Darwin-35B-A3B-Opus-Q8-GGUF


license: apache-2.0 base_model:

  • FINAL-Bench/Darwin-35B-A3B-Opus tags:
  • llama-cpp
  • gguf
  • quantized
  • Q8_0
  • merge
  • evolutionary-merge
  • darwin
  • darwin-v5
  • reasoning
  • qwen3.5
  • qwen
  • moe
  • mixture-of-experts
  • claude-opus
  • distillation
  • multilingual
  • gpqa
  • open-source
  • apache-2.0
  • layer-wise-merge
  • coding-agent
  • tool-calling
  • long-context
  • 262k-context language:
  • en
  • zh
  • ko
  • ja
  • de
  • fr
  • es
  • ru
  • ar
  • multilingual pipeline_tag: text-generation library_name: gguf quantized_by: VIDRAFT model-index:
  • name: Darwin-35B-A3B-Opus-Q8_0-GGUF results:
    • task: type: text-generation name: Graduate-Level Reasoning dataset: type: Idavidrein/gpqa name: GPQA Diamond config: gpqa_diamond split: train metrics:
      • type: accuracy value: 90.0 name: Accuracy verified: false
    • task: type: text-generation name: Multilingual Knowledge dataset: type: openai/MMMLU name: MMMLU metrics:
      • type: accuracy value: 85.0 name: Accuracy verified: false

Darwin-35B-A3B-Opus-Q8_0-GGUF

<p align="center"> <a href="https://huggingface.co/FINAL-Bench/Darwin-35B-A3B-Opus"><img src="https://img.shields.io/badge/Original-Darwin--35B--A3B--Opus-blue" alt="Original Model"></a> <a href="https://huggingface.co/spaces/FINAL-Bench/Leaderboard"><img src="https://img.shields.io/badge/FINAL_Bench-Leaderboard-green" alt="FINAL Bench"></a> <a href="https://huggingface.co/spaces/FINAL-Bench/all-bench-leaderboard"><img src="https://img.shields.io/badge/ALL_Bench-Leaderboard-orange" alt="ALL Bench"></a> </p>

Q8_0 GGUF of Darwin-35B-A3B-Opus | ~37GB (3 shards) | GPQA Diamond 90.0% | Near-lossless quality | MoE 35B (3B active) | 201 Languages | 262K Context | Apache 2.0


About This Quantization

Q8_0 GGUF of FINAL-Bench/Darwin-35B-A3B-Opus.

| | Original (BF16) | This Model (Q8_0 GGUF) | |---|---|---| | Format | SafeTensors | GGUF | | Size | 65.5 GB | ~37 GB (3 shards) | | Quality | Baseline | Near-lossless (~99.9% of BF16) | | VRAM Required | 65+ GB | ~37 GB | | Runs on | H100, A100 80GB | A100 40GB, Mac 64GB, 2x RTX 4090 | | Framework | Transformers, vLLM, SGLang | llama.cpp, Ollama, LM Studio |


Files

| File | Size | Description | |---|---|---| | darwin-35b-a3b-opus-q8_0-00001-of-00003.gguf | ~13.6 GB | Shard 1 of 3 | | darwin-35b-a3b-opus-q8_0-00002-of-00003.gguf | ~12.5 GB | Shard 2 of 3 | | darwin-35b-a3b-opus-q8_0-00003-of-00003.gguf | ~10.7 GB | Shard 3 of 3 | | Total | ~36.8 GB | All 3 shards required |

Download all 3 shard files. llama.cpp and Ollama will automatically load them together.


Hardware Requirements

| Setup | Memory | Status | |---|---|---| | NVIDIA A100 40GB | 40 GB VRAM | Fits | | NVIDIA A100 80GB | 80 GB VRAM | Comfortable | | NVIDIA H100 93GB | 93 GB VRAM | Comfortable | | 2x RTX 4090 (24GB each) | 48 GB VRAM | With tensor parallel | | Mac Studio M2/M3 Ultra 64GB | 64 GB Unified | Fits | | Mac M3 Max 48GB | 48 GB Unified | Fits | | Single RTX 4090 24GB | 24 GB VRAM | Insufficient (use Q4_K_M) |

As a MoE model, only 3B parameters are active per token. Inference is fast despite the 37GB model size.


Usage

llama.cpp (CLI)

llama-cli \
  --hf-repo FINAL-Bench/Darwin-35B-A3B-Opus-Q8-GGUF \
  --hf-file darwin-35b-a3b-opus-q8_0-00001-of-00003.gguf \
  -p "The meaning to life and the universe is" \
  -n 512 -ngl 99

llama.cpp (Server)

llama-server \
  --hf-repo FINAL-Bench/Darwin-35B-A3B-Opus-Q8-GGUF \
  --hf-file darwin-35b-a3b-opus-q8_0-00001-of-00003.gguf \
  -c 32768 -ngl 99

Ollama

echo 'FROM ./darwin-35b-a3b-opus-q8_0-00001-of-00003.gguf' > Modelfile
ollama create darwin-opus -f Modelfile
ollama run darwin-opus

LM Studio

  1. Download all 3 .gguf shard files
  2. Place them in the same folder
  3. Open LM Studio, load the first shard
  4. LM Studio auto-detects and loads all shards

MoE Expert Offload (Limited VRAM)

llama-cli \
  --hf-repo FINAL-Bench/Darwin-35B-A3B-Opus-Q8-GGUF \
  --hf-file darwin-35b-a3b-opus-q8_0-00001-of-00003.gguf \
  -ot ".ffn_.*_exps.=CPU" \
  -ngl 99 -c 32768

Benchmark Results (Original Model)

Q8_0 preserves near-identical performance to BF16.

GPQA Diamond (198 Questions, Graduate-Level Reasoning)

| Model | Accuracy | |---|---| | Darwin-35B-A3B-Opus | 90.0% | | Mother (Jackrong Claude 4.6 Opus Distilled) | 85.0% | | Father (Qwen3.5-35B-A3B Official) | 84.2% |

MMMLU (Multilingual Knowledge, 29 Languages)

| Model | Accuracy | |---|---| | Darwin-35B-A3B-Opus | 85.0% | | Father (Qwen3.5-35B-A3B Official) | 85.2% |


How Darwin Was Created

Darwin-35B-A3B-Opus was created using Darwin V5, a diagnostic-guided evolutionary merge engine built on mergekit.

Both parent models share the identical Qwen3.5-35B-A3B architecture. The Mother is a LoRA SFT on the same base โ€” not a different architecture.

Darwin V5 adds three phases over standard mergekit evolve:

  1. Pre-merge parent profiling (40 layers x 256 experts: activation frequency, routing entropy, probe cosine distance)
  2. Evolution with diagnostic-informed initial population and constrained search space
  3. Post-merge child validation (layer-by-layer comparison against both parents)

Key diagnostic finding: Mother had 50-65% dead experts (activation < 5%) from text-only LoRA SFT. Darwin compensated by reducing Mother density and using Father's living experts to fill inactive slots.

Merge configuration:

# Method: DARE-TIES via mergekit
L0-L37:  t=0.5988 (Mother 60%) โ€” router from Mother
L38:     t=0.9000 (Mother 90%) โ€” reasoning core (peak probe cosine distance)
L39:     t=0.5336 (Father 47%) โ€” router from Father (output routing)

For full technical details, diagnostics, and health check results, see the original model card.


Other Quantizations

| Quantization | Size | Quality | Use Case | |---|---|---|---| | Q8_0 (this) | ~37 GB | Near-lossless | Maximum quality | | Q4_K_M (coming soon) | ~20 GB | Good | RTX 4090, Mac 32GB |


Model Specifications

| | | |---|---| | Base Model | FINAL-Bench/Darwin-35B-A3B-Opus | | Architecture | Qwen3.5 MoE (Gated DeltaNet + MoE) | | Total Parameters | 35B | | Active Parameters | 3B per forward pass | | Experts | 256 (8 routed + 1 shared active) | | Context Length | 262,144 native | | Languages | 201 | | Quantization | Q8_0 (8-bit integer) | | GGUF Shards | 3 files | | License | Apache 2.0 | | Quantized by | VIDRAFT via llama.cpp |


Acknowledgements

  • Korean Government โ€” GPU Support Program research grant
  • Qwen Team โ€” Qwen3.5-35B-A3B base architecture
  • Jackrong โ€” Claude 4.6 Opus Reasoning Distilled model
  • mergekit โ€” Merge backend infrastructure
  • llama.cpp โ€” GGUF conversion and quantization

Citation

@misc{vidraft_darwin_35b_opus_gguf,
  title        = {Darwin-35B-A3B-Opus-Q8_0-GGUF},
  author       = {VIDRAFT},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/FINAL-Bench/Darwin-35B-A3B-Opus-Q8-GGUF}}
}

Author: FINAL-Bench

Likes: 12

Downloads: 0

Tags: gguf, qwen3_5_moe, llama-cpp, quantized, Q8_0, merge, evolutionary-merge, darwin, darwin-v5, reasoning, qwen3.5, qwen, moe, mixture-of-experts, claude-opus, distillation, multilingual, gpqa, open-source, apache-2.0, layer-wise-merge, coding-agent, tool-calling, long-context, 262k-context, text-generation, en, zh, ko, ja, de, fr, es, ru, ar, base_model:FINAL-Bench/Darwin-35B-A3B-Opus, base_model:quantized:FINAL-Bench/Darwin-35B-A3B-Opus, license:apache-2.0, model-index, endpoints_compatible, region:us, conversational

YTan2000/Qwopus3.5-27B-v3-TQ3_4S


license: apache-2.0 language:

  • en pipeline_tag: text-generation library_name: gguf tags:
  • gguf
  • llama.cpp
  • qwen
  • qwopus
  • quantization
  • turboquant
  • tq3_4s base_model:
  • Jackrong/Qwopus3.5-27B-v3
  • Qwen/Qwen3.5-27B base_model_relation: quantized model-index:
  • name: Qwopus3.5-27B-v3-TQ3_4S results: [] thumbnail: https://huggingface.co/YTan2000/Qwopus3.5-27B-v3-TQ3_4S/resolve/main/thumbnail.png

Qwopus3.5-27B-v3-TQ3_4S

TQ3_4S is a 3.5-bit Walsh-Hadamard-transform weight format with four per-8 scales per 32-weight block.

This release is a TQ3_4S GGUF quantization of Jackrong/Qwopus3.5-27B-v3, which is itself derived from the Qwen3.5-27B family.

Quantization Source

  • HF source checkout:
    • Jackrong/Qwopus3.5-27B-v3
  • upstream family:
    • Qwen/Qwen3.5-27B
  • F16 GGUF used as the quantization source:
    • Qwopus3.5-27B-v3-f16.gguf

Quantized with:

./build/bin/llama-quantize \
  /path/to/Qwopus3.5-27B-v3-f16.gguf \
  /path/to/Qwopus3.5-27B-v3-TQ3_4S.gguf \
  TQ3_4S \
  8

Quality

Full-pass wiki.test.raw, c=2048:

  • Final PPL = 6.3433 +/- 0.03999
  • Median chunk PPL = 6.1953

Runtime Validation

Validated on clean public llama.cpp-tq3 main:

  • runtime commit: 62eb27dce
  • strict chat smoke:
    • prompt: Write ONLY the word ok.
    • response: ok

Validated server profile:

./build/bin/llama-server \
  -m /path/to/Qwopus3.5-27B-v3-TQ3_4S.gguf \
  -a qwopus35-27b-v3-tq3_4s \
  --host 127.0.0.1 --port 8080 \
  -ngl 99 -c 4096 -np 1 \
  -ctk q8_0 -ctv q8_0 -fa on \
  --no-warmup --jinja \
  --reasoning off --reasoning-budget 0 --reasoning-format deepseek

Notes

  • This is a weight quantization release for the Qwopus v3 model line.
  • The TQ3_4S runtime is provided by:
    • turbo-tan/llama.cpp-tq3

Credits

Author: YTan2000

Likes: 9

Downloads: 0

Tags: gguf, llama.cpp, qwen, qwopus, quantization, turboquant, tq3_4s, text-generation, en, base_model:Jackrong/Qwopus3.5-27B-v3, base_model:quantized:Jackrong/Qwopus3.5-27B-v3, license:apache-2.0, endpoints_compatible, region:us, conversational

mlx-community/gemma-4-26b-a4b-it-4bit


library_name: mlx license: apache-2.0 license_link: https://ai.google.dev/gemma/docs/gemma_4_license pipeline_tag: image-text-to-text tags:

  • mlx base_model: google/gemma-4-26b-a4b-it

mlx-community/gemma-4-26b-a4b-it-4bit

This model was converted to MLX format from google/gemma-4-26b-a4b-it using mlx-vlm version 0.4.3. Refer to the original model card for more details on the model.

Use with mlx

pip install -U mlx-vlm
python -m mlx_vlm.generate --model mlx-community/gemma-4-26b-a4b-it-4bit --max-tokens 100 --temperature 0.0 --prompt "Describe this image." --image <path_to_image>

Author: mlx-community

Likes: 8

Downloads: 0

Tags: mlx, safetensors, gemma4, image-text-to-text, conversational, license:apache-2.0, 4-bit, region:us

mlx-community/gemma-4-31b-it-4bit


library_name: mlx license: apache-2.0 license_link: https://ai.google.dev/gemma/docs/gemma_4_license pipeline_tag: image-text-to-text tags:

  • mlx base_model: google/gemma-4-31b-it

mlx-community/gemma-4-31b-it-4bit

This model was converted to MLX format from google/gemma-4-31b-it using mlx-vlm version 0.4.3. Refer to the original model card for more details on the model.

Use with mlx

pip install -U mlx-vlm
python -m mlx_vlm.generate --model mlx-community/gemma-4-31b-it-4bit --max-tokens 100 --temperature 0.0 --prompt "Describe this image." --image <path_to_image>

Author: mlx-community

Likes: 8

Downloads: 0

Tags: mlx, safetensors, gemma4, image-text-to-text, conversational, license:apache-2.0, 4-bit, region:us

p-e-w/gemma-4-E2B-it-heretic-ara


library_name: transformers license: apache-2.0 license_link: https://ai.google.dev/gemma/docs/gemma_4_license pipeline_tag: any-to-any tags:

  • heretic
  • uncensored
  • decensored
  • abliterated
  • ara

This is a decensored version of google/gemma-4-E2B-it, made using Heretic v1.2.0 with the Arbitrary-Rank Ablation (ARA) method (with row-norm preservation)

Abliteration parameters

| Parameter | Value | | :-------- | :---: | | start_layer_index | 16 | | end_layer_index | 32 | | preserve_good_behavior_weight | 0.1887 | | steer_bad_behavior_weight | 0.0001 | | overcorrect_relative_weight | 0.6737 | | neighbor_count | 4 |

Performance

| Metric | This model | Original model (google/gemma-4-E2B-it) | | :----- | :--------: | :---------------------------: | | KL divergence | 0.1522 | 0 (by definition) | | Refusals | 5/100 | 98/100 |


<div align="center"> <img src=https://ai.google.dev/gemma/images/gemma4_banner.png> </div> <p align="center"> <a href="https://huggingface.co/collections/google/gemma-4" target="_blank">Hugging Face</a> | <a href="https://github.com/google-gemma" target="_blank">GitHub</a> | <a href="https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/" target="_blank">Launch Blog</a> | <a href="https://ai.google.dev/gemma/docs/core" target="_blank">Documentation</a> <br> <b>License</b>: <a href="https://ai.google.dev/gemma/docs/gemma_4_license" target="_blank">Apache 2.0</a> | <b>Authors</b>: <a href="https://deepmind.google/models/gemma/" target="_blank">Google DeepMind</a> </p>

Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on small models) and generating text output. This release includes open-weights models in both pre-trained and instruction-tuned variants. Gemma 4 features a context window of up to 256K tokens and maintains multilingual support in over 140 languages.

Featuring both Dense and Mixture-of-Experts (MoE) architectures, Gemma 4 is well-suited for tasks like text generation, coding, and reasoning. The models are available in four distinct sizes: E2B, E4B, 26B A4B, and 31B. Their diverse sizes make them deployable in environments ranging from high-end phones to laptops and servers, democratizing access to state-of-the-art AI.

Gemma 4 introduces key capability and architectural advancements:

  • Reasoning โ€“ All models in the family are designed as highly capable reasoners, with configurable thinking modes.

  • Extended Multimodalities โ€“ Processes Text, Image with variable aspect ratio and resolution support (all models), Video, and Audio (featured natively on the E2B and E4B models).

  • Diverse & Efficient Architectures โ€“ Offers Dense and Mixture-of-Experts (MoE) variants of different sizes for scalable deployment.

  • Optimized for On-Device โ€“ Smaller models are specifically designed for efficient local execution on laptops and mobile devices.

  • Increased Context Window โ€“ The small models feature a 128K context window, while the medium models support 256K.

  • Enhanced Coding & Agentic Capabilities โ€“ Achieves notable improvements in coding benchmarks alongside native function-calling support, powering highly capable autonomous agents.

  • Native System Prompt Support โ€“ Gemma 4 introduces native support for the system role, enabling more structured and controllable conversations.

Models Overview

Gemma 4 models are designed to deliver frontier-level performance at each size, targeting deployment scenarios from mobile and edge devices (E2B, E4B) to consumer GPUs and workstations (26B A4B, 31B). They are well-suited for reasoning, agentic workflows, coding, and multimodal understanding.

The models employ a hybrid attention mechanism that interleaves local sliding window attention with full global attention, ensuring the final layer is always global. This hybrid design delivers the processing speed and low memory footprint of a lightweight model without sacrificing the deep awareness required for complex, long-context tasks. To optimize memory for long contexts, global layers feature unified Keys and Values, and apply Proportional RoPE (p-RoPE).

Dense Models

| Property | E2B | E4B | 31B Dense | | :---- | :---- | :---- | :---- | | Total Parameters | 2.3B effective (5.1B with embeddings) | 4.5B effective (8B with embeddings) | 30.7B | | Layers | 35 | 42 | 60 | | Sliding Window | 512 tokens | 512 tokens | 1024 tokens | | Context Length | 128K tokens | 128K tokens | 256K tokens | | Vocabulary Size | 262K | 262K | 262K | | Supported Modalities | Text, Image, Audio | Text, Image, Audio | Text, Image | | Vision Encoder Parameters | ~150M | ~150M | ~550M | | Audio Encoder Parameters | ~300M | ~300M | No Audio |

The "E" in E2B and E4B stands for "effective" parameters. The smaller models incorporate Per-Layer Embeddings (PLE) to maximize parameter efficiency in on-device deployments. Rather than adding more layers or parameters to the model, PLE gives each decoder layer its own small embedding for every token. These embedding tables are large but are only used for quick lookups, which is why the effective parameter count is much smaller than the total.

Mixture-of-Experts (MoE) Model

| Property | 26B A4B MoE | | :---- | :---- | | Total Parameters | 25.2B | | Active Parameters | 3.8B | | Layers | 30 | | Sliding Window | 1024 tokens | | Context Length | 256K tokens | | Vocabulary Size | 262K | | Expert Count | 8 active / 128 total and 1 shared | | Supported Modalities | Text, Image | | Vision Encoder Parameters | ~550M |

The "A" in 26B A4B stands for "active parameters" in contrast to the total number of parameters the model contains. By only activating a 4B subset of parameters during inference, the Mixture-of-Experts model runs much faster than its 26B total might suggest. This makes it an excellent choice for fast inference compared to the dense 31B model since it runs almost as fast as a 4B-parameter model.

Benchmark Results

These models were evaluated against a large collection of different datasets and metrics to cover different aspects of text generation. Evaluation results marked in the table are for instruction-tuned models.

| | Gemma 4 31B | Gemma 4 26B A4B | Gemma 4 E4B | Gemma 4 E2B | Gemma 3 27B (no think) | | :---- | :---- | :---- | :---- | :---- | :---- | | MMLU Pro | 85.2% | 82.6% | 69.4% | 60.0% | 67.6% | | AIME 2026 no tools | 89.2% | 88.3% | 42.5% | 37.5% | 20.8% | | LiveCodeBench v6 | 80.0% | 77.1% | 52.0% | 44.0% | 29.1% | | Codeforces ELO | 2150 | 1718 | 940 | 633 | 110 | | GPQA Diamond | 84.3% | 82.3% | 58.6% | 43.4% | 42.4% | | Tau2 (average over 3) | 76.9% | 68.2% | 42.2% | 24.5% | 16.2% | | HLE no tools | 19.5% | 8.7% | - | - | - | | HLE with search | 26.5% | 17.2% | - | - | - | | BigBench Extra Hard | 74.4% | 64.8% | 33.1% | 21.9% | 19.3% | | MMMLU | 88.4% | 86.3% | 76.6% | 67.4% | 70.7% | | Vision | | | | | | | MMMU Pro | 76.9% | 73.8% | 52.6% | 44.2% | 49.7% | | OmniDocBench 1.5 (average edit distance, lower is better) | 0.131 | 0.149 | 0.181 | 0.290 | 0.365 | | MATH-Vision | 85.6% | 82.4% | 59.5% | 52.4% | 46.0% | | MedXPertQA MM | 61.3% | 58.1% | 28.7% | 23.5% | - | | Audio | | | | | | | CoVoST | - | - | 35.54 | 33.47 | - | | FLEURS (lower is better) | - | - | 0.08 | 0.09 | - | | Long Context | | | | | | | MRCR v2 8 needle 128k (average) | 66.4% | 44.1% | 25.4% | 19.1% | 13.5% |

Core Capabilities

Gemma 4 models handle a broad range of tasks across text, vision, and audio. Key capabilities include:

  • Thinking โ€“ Built-in reasoning mode that lets the model think step-by-step before answering.
  • Long Context โ€“ Context windows of up to 128K tokens (E2B/E4B) and 256K tokens (26B A4B/31B).
  • Image Understanding โ€“ Object detection, Document/PDF parsing, screen and UI understanding, chart comprehension, OCR (including multilingual), handwriting recognition, and pointing. Images can be processed at variable aspect ratios and resolutions.
  • Video Understanding โ€“ Analyze video by processing sequences of frames.
  • Interleaved Multimodal Input โ€“ Freely mix text and images in any order within a single prompt.
  • Function Calling โ€“ Native support for structured tool use, enabling agentic workflows.
  • Coding โ€“ Code generation, completion, and correction.
  • Multilingual โ€“ Out-of-the-box support for 35+ languages, pre-trained on 140+ languages.
  • Audio (E2B and E4B only) โ€“ Automatic speech recognition (ASR) and speech-to-translated-text translation across multiple languages.

Getting Started

You can use all Gemma 4 models with the latest version of Transformers. To get started, install the necessary dependencies in your environment:

pip install -U transformers torch accelerate

Once you have everything installed, you can proceed to load the model with the code below:

from transformers import AutoProcessor, AutoModelForCausalLM

MODEL_ID = "google/gemma-4-E2B-it"

# Load model
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    dtype="auto",
    device_map="auto"
)

Once the model is loaded, you can start generating output:

# Prompt
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Write a short joke about saving RAM."},
]

# Process input
text = processor.apply_chat_template(
    messages, 
    tokenize=False, 
    add_generation_prompt=True, 
    enable_thinking=False
)
inputs = processor(text=text, return_tensors="pt").to(model.device)
input_len = inputs["input_ids"].shape[-1]

# Generate output
outputs = model.generate(**inputs, max_new_tokens=1024)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)

# Parse output
processor.parse_response(response)

To enable reasoning, set enable_thinking=True and the parse_response function will take care of parsing the thinking output.

Below, you will also find snippets for processing audio (E2B and E4B only), images, and video alongside text:

<details> <summary>Code for processing Audio</summary>

Instead of using AutoModelForCausalLM, you can use AutoModelForMultimodalLM to process audio. To use it, make sure to install the following packages:

pip install -U transformers torch torchvision librosa accelerate

You can then load the model with the code below:

from transformers import AutoProcessor, AutoModelForMultimodalLM

MODEL_ID = "google/gemma-4-E2B-it"

# Load model
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForMultimodalLM.from_pretrained(
    MODEL_ID, 
    dtype="auto", 
    device_map="auto"
)

Once the model is loaded, you can start generating output by directly referencing the audio URL in the prompt:

# Prompt - add audio before text
messages = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "https://raw.githubusercontent.com/google-gemma/cookbook/refs/heads/main/Demos/sample-data/journal1.wav"},
            {"type": "text", "text": "Transcribe the following speech segment in its original language. Follow these specific instructions for formatting the answer:\n* Only output the transcription, with no newlines.\n* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three."},
        ]
    }
]

# Process input
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)
input_len = inputs["input_ids"].shape[-1]

# Generate output
outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)

# Parse output
processor.parse_response(response)
</details> <details> <summary>Code for processing Images</summary>

Instead of using AutoModelForCausalLM, you can use AutoModelForMultimodalLM to process images. To use it, make sure to install the following packages:

pip install -U transformers torch torchvision accelerate

You can then load the model with the code below:

from transformers import AutoProcessor, AutoModelForMultimodalLM

MODEL_ID = "google/gemma-4-E2B-it"

# Load model
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForMultimodalLM.from_pretrained(
    MODEL_ID, 
    dtype="auto", 
    device_map="auto"
)

Once the model is loaded, you can start generating output by directly referencing the image URL in the prompt:

# Prompt - add image before text
messages = [
    {
        "role": "user", "content": [
            {"type": "image", "url": "https://raw.githubusercontent.com/google-gemma/cookbook/refs/heads/main/Demos/sample-data/GoldenGate.png"},
            {"type": "text", "text": "What is shown in this image?"}
        ]
    }
]

# Process input
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)
input_len = inputs["input_ids"].shape[-1]

# Generate output
outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)

# Parse output
processor.parse_response(response)
</details> <details> <summary>Code for processing Videos</summary>

Instead of using AutoModelForCausalLM, you can use AutoModelForMultimodalLM to process videos. To use it, make sure to install the following packages:

pip install -U transformers torch torchvision librosa accelerate

You can then load the model with the code below:

from transformers import AutoProcessor, AutoModelForMultimodalLM

MODEL_ID = "google/gemma-4-E2B-it"

# Load model
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForMultimodalLM.from_pretrained(
    MODEL_ID, 
    dtype="auto", 
    device_map="auto"
)

Once the model is loaded, you can start generating output by directly referencing the video URL in the prompt:

# Prompt - add video before text
messages = [
    {
        'role': 'user',
        'content': [
            {"type": "video", "video": "https://github.com/bebechien/gemma/raw/refs/heads/main/videos/ForBiggerBlazes.mp4"},
            {'type': 'text', 'text': 'Describe this video.'}
        ]
    }
]

# Process input
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)
input_len = inputs["input_ids"].shape[-1]

# Generate output
outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)

# Parse output
processor.parse_response(response)
</details>

Best Practices

For the best performance, use these configurations and best practices:

1. Sampling Parameters

Use the following standardized sampling configuration across all use cases:

  • temperature=1.0
  • top_p=0.95
  • top_k=64

2. Thinking Mode Configuration

Compared to Gemma 3, the models use standard system, assistant, and user roles. To properly manage the thinking process, use the following control tokens:

  • Trigger Thinking: Thinking is enabled by including the <|think|> token at the start of the system prompt. To disable thinking, remove the token.
  • Standard Generation: When thinking is enabled, the model will output its internal reasoning followed by the final answer using this structure:
    <|channel>thought\n[Internal reasoning]<channel|>
  • Disabled Thinking Behavior: For all models except for the E2B and E4B variants, if thinking is disabled, the model will still generate the tags but with an empty thought block:
    <|channel>thought\n<channel|>[Final answer]

[!Note] Note that many libraries like Transformers and llama.cpp handle the complexities of the chat template for you.

3. Multi-Turn Conversations

  • No Thinking Content in History: In multi-turn conversations, the historical model output should only include the final response. Thoughts from previous model turns must not be added before the next user turn begins.

4. Modality order

  • For optimal performance with multimodal inputs, place image and/or audio content before the text in your prompt.

5. Variable Image Resolution

Aside from variable aspect ratios, Gemma 4 supports variable image resolution through a configurable visual token budget, which controls how many tokens are used to represent an image. A higher token budget preserves more visual detail at the cost of additional compute, while a lower budget enables faster inference for tasks that don't require fine-grained understanding.

  • The supported token budgets are: 70, 140, 280, 560, and 1120.
    • Use lower budgets for classification, captioning, or video understanding, where faster inference and processing many frames outweigh fine-grained detail.
    • Use higher budgets for tasks like OCR, document parsing, or reading small text.

6. Audio

Use the following prompt structures for audio processing:

  • Audio Speech Recognition (ASR)
Transcribe the following speech segment in {LANGUAGE} into {LANGUAGE} text.

Follow these specific instructions for formatting the answer:
* Only output the transcription, with no newlines.
* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three.
  • Automatic Speech Translation (AST)
Transcribe the following speech segment in {SOURCE_LANGUAGE}, then translate it into {TARGET_LANGUAGE}.
When formatting the answer, first output the transcription in {SOURCE_LANGUAGE}, then one newline, then output the string '{TARGET_LANGUAGE}: ', then the translation in {TARGET_LANGUAGE}.

7. Audio and Video Length

All models support image inputs and can process videos as frames whereas the E2B and E4B models also support audio inputs. Audio supports a maximum length of 30 seconds. Video supports a maximum of 60 seconds assuming the images are processed at one frame per second.

Model Data

Data used for model training and how the data was processed.

Training Dataset

Our pre-training dataset is a large-scale, diverse collection of data encompassing a wide range of domains and modalities, which includes web documents, code, images, audio, with a cutoff date of January 2025. Here are the key components:

  • Web Documents: A diverse collection of web text ensures the model is exposed to a broad range of linguistic styles, topics, and vocabulary. The training dataset includes content in over 140 languages.
  • Code: Exposing the model to code helps it to learn the syntax and patterns of programming languages, which improves its ability to generate code and understand code-related questions.
  • Mathematics: Training on mathematical text helps the model learn logical reasoning, symbolic representation, and to address mathematical queries.
  • Images: A wide range of images enables the model to perform image analysis and visual data extraction tasks.

The combination of these diverse data sources is crucial for training a powerful multimodal model that can handle a wide variety of different tasks and data formats.

Data Preprocessing

Here are the key data cleaning and filtering methods applied to the training data:

  • CSAM Filtering: Rigorous CSAM (Child Sexual Abuse Material) filtering was applied at multiple stages in the data preparation process to ensure the exclusion of harmful and illegal content.
  • Sensitive Data Filtering: As part of making Gemma pre-trained models safe and reliable, automated techniques were used to filter out certain personal information and other sensitive data from training sets.
  • Additional methods: Filtering based on content quality and safety in line with our policies.

Ethics and Safety

As open models become central to enterprise infrastructure, provenance and security are paramount. Developed by Google DeepMind, Gemma 4 undergoes the same rigorous safety evaluations as our proprietary Gemini models.

Evaluation Approach

Gemma 4 models were developed in partnership with internal safety and responsible AI teams. A range of automated as well as human evaluations were conducted to help improve model safety. These evaluations align with Googleโ€™s AI principles, as well as safety policies, which aim to prevent our generative AI models from generating harmful content, including:

  • Content related to child sexual abuse material and exploitation
  • Dangerous content (e.g., promoting suicide, or instructing in activities that could cause real-world harm)
  • Sexually explicit content
  • Hate speech (e.g., dehumanizing members of protected groups)
  • Harassment (e.g., encouraging violence against people)

Evaluation Results

For all areas of safety testing, we saw major improvements in all categories of content safety relative to previous Gemma models. Overall, Gemma 4 models significantly outperform Gemma 3 and 3n models in improving safety, while keeping unjustified refusals low. All testing was conducted without safety filters to evaluate the model capabilities and behaviors. For both text-to-text and image-to-text, and across all model sizes, the model produced minimal policy violations, and showed significant improvements over previous Gemma models' performance.

Usage and Limitations

These models have certain limitations that users should be aware of.

Intended Usage

Multimodal models (capable of processing vision, language, and/or audio) have a wide range of applications across various industries and domains. The following list of potential uses is not comprehensive. The purpose of this list is to provide contextual information about the possible use-cases that the model creators considered as part of model training and development.

  • Content Creation and Communication
    • Text Generation: These models can be used to generate creative text formats such as poems, scripts, code, marketing copy, and email drafts.
    • Chatbots and Conversational AI: Power conversational interfaces for customer service, virtual assistants, or interactive applications.
    • Text Summarization: Generate concise summaries of a text corpus, research papers, or reports.
    • Image Data Extraction: These models can be used to extract, interpret, and summarize visual data for text communications.
    • Audio Processing and Interaction: The smaller models (E2B and E4B) can analyze and interpret audio inputs, enabling voice-driven interactions and transcriptions.
  • Research and Education
    • Natural Language Processing (NLP) and VLM Research: These models can serve as a foundation for researchers to experiment with VLM and NLP techniques, develop algorithms, and contribute to the advancement of the field.
    • Language Learning Tools: Support interactive language learning experiences, aiding in grammar correction or providing writing practice.
    • Knowledge Exploration: Assist researchers in exploring large bodies of text by generating summaries or answering questions about specific topics.

Limitations

  • Training Data
    • The quality and diversity of the training data significantly influence the model's capabilities. Biases or gaps in the training data can lead to limitations in the model's responses.
    • The scope of the training dataset determines the subject areas the model can handle effectively.
  • Context and Task Complexity
    • Models perform well on tasks that can be framed with clear prompts and instructions. Open-ended or highly complex tasks might be challenging.
    • A model's performance can be influenced by the amount of context provided (longer context generally leads to better outputs, up to a certain point).
  • Language Ambiguity and Nuance
    • Natural language is inherently complex. Models might struggle to grasp subtle nuances, sarcasm, or figurative language.
  • Factual Accuracy
    • Models generate responses based on information they learned from their training datasets, but they are not knowledge bases. They may generate incorrect or outdated factual statements.
  • Common Sense
    • Models rely on statistical patterns in language. They might lack the ability to apply common sense reasoning in certain situations.

Ethical Considerations and Risks

The development of vision-language models (VLMs) raises several ethical concerns. In creating an open model, we have carefully considered the following:

  • Bias and Fairness
    • VLMs trained on large-scale, real-world text and image data can reflect socio-cultural biases embedded in the training material. Gemma 4 models underwent careful scrutiny, input data pre-processing, and post-training evaluations as reported in this card to help mitigate the risk of these biases.
  • Misinformation and Misuse
    • VLMs can be misused to generate text that is false, misleading, or harmful.
    • Guidelines are provided for responsible use with the model, see the Responsible Generative AI Toolkit.
  • Transparency and Accountability
    • This model card summarizes details on the models' architecture, capabilities, limitations, and evaluation processes.
    • A responsibly developed open model offers the opportunity to share innovation by making VLM technology accessible to developers and researchers across the AI ecosystem.

Risks identified and mitigations:

  • Generation of harmful content: Mechanisms and guidelines for content safety are essential. Developers are encouraged to exercise caution and implement appropriate content safety safeguards based on their specific product policies and application use cases.
  • Misuse for malicious purposes: Technical limitations and developer and end-user education can help mitigate against malicious applications of VLMs. Educational resources and reporting mechanisms for users to flag misuse are provided.
  • Privacy violations: Models were trained on data filtered for removal of certain personal information and other sensitive data. Developers are encouraged to adhere to privacy regulations with privacy-preserving techniques.
  • Perpetuation of biases: It's encouraged to perform continuous monitoring (using evaluation metrics, human review) and the exploration of de-biasing techniques during model training, fine-tuning, and other use cases.

Benefits

At the time of release, this family of models provides high-performance open vision-language model implementations designed from the ground up for responsible AI development compared to similarly sized models.

Author: p-e-w

Likes: 7

Downloads: 0

Tags: transformers, safetensors, gemma4, image-text-to-text, heretic, uncensored, decensored, abliterated, ara, any-to-any, license:apache-2.0, endpoints_compatible, region:us

bartowski/google_gemma-4-31B-it-GGUF


quantized_by: bartowski pipeline_tag: image-text-to-text base_model_relation: quantized license_link: https://ai.google.dev/gemma/docs/gemma_4_license license: apache-2.0 base_model: google/gemma-4-31B-it

Warning: Something seems wrong with conversion and is being investigated, will update when we know more (this is a problem with llama.cpp and should affect all Gemma 4 models)

Don't download if you're limited on bandwidth, wait for fixes in the coming (hopefully) hours

Llamacpp imatrix Quantizations of gemma-4-31B-it by google

Using <a href="https://github.com/ggml-org/llama.cpp/">llama.cpp</a> release <a href="https://github.com/ggml-org/llama.cpp/releases/tag/b8637">b8637</a> for quantization.

Original model: https://huggingface.co/google/gemma-4-31B-it

All quants made using imatrix option with dataset from here

Run them in your choice of tools:

Note: if it's a newly supported model, you may need to wait for an update from the developers.

Prompt format

<bos><|turn>system
{system_prompt}<turn|>
<|turn>user
{prompt}<turn|>
<|turn>model
<|channel>thought
<channel|>

Download a file (not the whole branch) from below:

| Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | gemma-4-31B-it-bf16.gguf | bf16 | 61.41GB | true | Full BF16 weights. | | gemma-4-31B-it-Q8_0.gguf | Q8_0 | 32.64GB | false | Extremely high quality, generally unneeded but max available quant. | | gemma-4-31B-it-Q6_K_L.gguf | Q6_K_L | 27.07GB | false | Uses Q8_0 for embed and output weights. Very high quality, near perfect, recommended. | | gemma-4-31B-it-Q6_K.gguf | Q6_K | 26.73GB | false | Very high quality, near perfect, recommended. | | gemma-4-31B-it-Q5_K_L.gguf | Q5_K_L | 22.95GB | false | Uses Q8_0 for embed and output weights. High quality, recommended. | | gemma-4-31B-it-Q5_K_M.gguf | Q5_K_M | 22.61GB | false | High quality, recommended. | | gemma-4-31B-it-Q5_K_S.gguf | Q5_K_S | 21.50GB | false | High quality, recommended. | | gemma-4-31B-it-Q4_K_L.gguf | Q4_K_L | 19.94GB | false | Uses Q8_0 for embed and output weights. Good quality, recommended. | | gemma-4-31B-it-Q4_1.gguf | Q4_1 | 19.77GB | false | Legacy format, similar performance to Q4_K_S but with improved tokens/watt on Apple silicon. | | gemma-4-31B-it-Q4_K_M.gguf | Q4_K_M | 19.60GB | false | Good quality, default size for most use cases, recommended. | | gemma-4-31B-it-Q4_K_S.gguf | Q4_K_S | 18.20GB | false | Slightly lower quality with more space savings, recommended. | | gemma-4-31B-it-Q4_0.gguf | Q4_0 | 18.08GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | gemma-4-31B-it-IQ4_NL.gguf | IQ4_NL | 18.03GB | false | Similar to IQ4_XS, but slightly larger. Offers online repacking for ARM CPU inference. | | gemma-4-31B-it-IQ4_XS.gguf | IQ4_XS | 17.16GB | false | Decent quality, smaller than Q4_K_S with similar performance, recommended. | | gemma-4-31B-it-Q3_K_XL.gguf | Q3_K_XL | 17.15GB | false | Uses Q8_0 for embed and output weights. Lower quality but usable, good for low RAM availability. | | gemma-4-31B-it-Q3_K_L.gguf | Q3_K_L | 16.81GB | false | Lower quality but usable, good for low RAM availability. | | gemma-4-31B-it-Q3_K_M.gguf | Q3_K_M | 15.92GB | false | Low quality. | | gemma-4-31B-it-IQ3_M.gguf | IQ3_M | 15.13GB | false | Medium-low quality, new method with decent performance comparable to Q3_K_M. | | gemma-4-31B-it-Q3_K_S.gguf | Q3_K_S | 14.33GB | false | Low quality, not recommended. | | gemma-4-31B-it-IQ3_XS.gguf | IQ3_XS | 13.84GB | false | Lower quality, new method with decent performance, slightly better than Q3_K_S. | | gemma-4-31B-it-IQ3_XXS.gguf | IQ3_XXS | 12.98GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | gemma-4-31B-it-Q2_K_L.gguf | Q2_K_L | 12.97GB | false | Uses Q8_0 for embed and output weights. Very low quality but surprisingly usable. | | gemma-4-31B-it-IQ2_M.gguf | IQ2_M | 12.65GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | gemma-4-31B-it-Q2_K.gguf | Q2_K | 12.63GB | false | Very low quality but surprisingly usable. | | gemma-4-31B-it-IQ2_S.gguf | IQ2_S | 12.08GB | false | Low quality, uses SOTA techniques to be usable. | | gemma-4-31B-it-IQ2_XS.gguf | IQ2_XS | 11.50GB | false | Low quality, uses SOTA techniques to be usable. | | gemma-4-31B-it-IQ2_XXS.gguf | IQ2_XXS | 10.83GB | false | Very low quality, uses SOTA techniques to be usable. | | gemma-4-31B-it-IQ1_M.gguf | IQ1_M | 10.11GB | false | Extremely low quality, not recommended. |

Embed/output weights

Some of these quants (Q3_K_XL, Q4_K_L etc) are the standard quantization method with the embeddings and output weights quantized to Q8_0 instead of what they would normally default to.

Downloading using huggingface-cli

<details> <summary>Click to view download instructions</summary>

First, make sure you have hugginface-cli installed:

pip install -U "huggingface_hub[cli]"

Then, you can target the specific file you want:

huggingface-cli download bartowski/google_gemma-4-31B-it-GGUF --include "google_gemma-4-31B-it-Q4_K_M.gguf" --local-dir ./

If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run:

huggingface-cli download bartowski/google_gemma-4-31B-it-GGUF --include "google_gemma-4-31B-it-Q8_0/*" --local-dir ./

You can either specify a new local-dir (google_gemma-4-31B-it-Q8_0) or download them all in place (./)

</details>

ARM/AVX information

Previously, you would download Q4_0_4_4/4_8/8_8, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass.

Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q4_0 and your hardware would benefit from repacking weights, it will do it automatically on the fly.

As of llama.cpp build b4282 you will not be able to run the Q4_0_X_X files and will instead need to use Q4_0.

Additionally, if you want to get slightly better quality for , you can use IQ4_NL thanks to this PR which will also repack the weights for ARM, though only the 4_4 for now. The loading time may be slower but it will result in an overall speed incrase.

<details> <summary>Click to view Q4_0_X_X information (deprecated</summary>

I'm keeping this section to show the potential theoretical uplift in performance from using the Q4_0 with online repacking.

<details> <summary>Click to view benchmarks on an AVX2 system (EPYC7702)</summary>

| model | size | params | backend | threads | test | t/s | % (vs Q4_0) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ยฑ 1.03 | 100% | | qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ยฑ 0.19 | 100% | | qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ยฑ 0.44 | 100% | | qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ยฑ 0.27 | 100% | | qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ยฑ 0.69 | 100% | | qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ยฑ 0.03 | 100% | | qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ยฑ 1.74 | 147% | | qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ยฑ 0.20 | 101% | | qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ยฑ 1.81 | 101% | | qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ยฑ 0.99 | 48% | | qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ยฑ 3.04 | 83% | | qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ยฑ 3.59 | 90% | | qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ยฑ 3.53 | 133% | | qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ยฑ 45.63 | 100% | | qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ยฑ 5.00 | 124% | | qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ยฑ 0.05 | 111% | | qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ยฑ 0.09 | 110% | | qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ยฑ 0.31 | 105% |

Q4_0_8_8 offers a nice bump to prompt processing and a small bump to text generation

</details> </details>

Which file should I choose?

<details> <summary>Click here for details</summary>

A great write up with charts showing various performances is provided by Artefact2 here

The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have.

If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM.

If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total.

Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'.

If you don't want to think too much, grab one of the K-quants. These are in format 'QX_K_X', like Q5_K_M.

If you want to get more into the weeds, you can check out this extremely useful feature chart:

llama.cpp feature matrix

But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQX_X, like IQ3_M. These are newer and offer better performance for their size.

These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide.

</details>

Credits

Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset.

Thank you ZeroWw for the inspiration to experiment with embed/output.

Thank you to LM Studio for sponsoring my work.

Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

Author: bartowski

Likes: 6

Downloads: 0

Tags: gguf, image-text-to-text, base_model:google/gemma-4-31B-it, base_model:quantized:google/gemma-4-31B-it, license:apache-2.0, endpoints_compatible, region:us, imatrix, conversational

Jackrong/MLX-Qwopus3.5-27B-v3-4bit


language:

  • en
  • zh
  • ko license: apache-2.0 base_model: Jackrong/Qwopus3.5-27B-v3 tags:
  • unsloth
  • qwen
  • qwen3.5
  • reasoning
  • chain-of-thought
  • lora
  • competitive-programming
  • mlx pipeline_tag: text-generation library_name: mlx

Jackrong/MLX-Qwopus3.5-27B-v3-4bit

This model Jackrong/MLX-Qwopus3.5-27B-v3-4bit was converted to MLX format from Jackrong/Qwopus3.5-27B-v3 using mlx-lm version 0.30.7.

Use with mlx

pip install mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("Jackrong/MLX-Qwopus3.5-27B-v3-4bit")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True, return_dict=False,
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)

Author: Jackrong

Likes: 6

Downloads: 0

Tags: mlx, safetensors, qwen3_5, unsloth, qwen, qwen3.5, reasoning, chain-of-thought, lora, competitive-programming, text-generation, conversational, en, zh, ko, base_model:Jackrong/Qwopus3.5-27B-v3, base_model:adapter:Jackrong/Qwopus3.5-27B-v3, license:apache-2.0, 4-bit, region:us

mlx-community/gemma-4-26b-a4b-4bit


library_name: mlx license: apache-2.0 license_link: https://ai.google.dev/gemma/docs/gemma_4_license pipeline_tag: image-text-to-text tags:

  • mlx base_model: google/gemma-4-26b-a4b

mlx-community/gemma-4-26b-a4b-4bit

This model was converted to MLX format from google/gemma-4-26b-a4b using mlx-vlm version 0.4.3. Refer to the original model card for more details on the model.

Use with mlx

pip install -U mlx-vlm
python -m mlx_vlm.generate --model mlx-community/gemma-4-26b-a4b-4bit --max-tokens 100 --temperature 0.0 --prompt "Describe this image." --image <path_to_image>

Author: mlx-community

Likes: 5

Downloads: 0

Tags: mlx, safetensors, gemma4, image-text-to-text, license:apache-2.0, 4-bit, region:us