Todays AI Summary

AI Insights: Spatial Intelligence, Granular Segmentation, and On-the-Fly Agent Evolution

Here's a look at some of the most interesting developments in AI from today, covering advancements in spatial reasoning, image segmentation, and self-improving software agents.

Research Highlights

Several compelling research papers have emerged, pushing the boundaries of AI capabilities:

  • Scaling Spatial Intelligence: The paper "Scaling Spatial Intelligence with Multimodal Foundation Models" introduces SenseNova-SI, a family of models designed to enhance spatial reasoning. By training on a meticulously curated dataset of 8 million samples, SenseNova-SI achieves state-of-the-art performance across multiple spatial intelligence benchmarks, while maintaining strong general multimodal understanding. The research also explores the impact of data scaling and emergent generalization capabilities.
  • Segment Anything at Any Granularity: "UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity" presents a novel approach to image segmentation. UnSAMv2 extends the capabilities of the Segment Anything Model (SAM) by enabling precise, continuous control over segmentation scale without requiring human annotations. By using a divide-and-conquer strategy and a novel granularity control embedding, UnSAMv2 significantly enhances SAM-2's performance across various segmentation tasks.
  • Explainable AI for Extreme Events: The paper "From Black Box to Insight: Explainable AI for Extreme Event Preparedness" explores the use of explainable AI (XAI) in predicting extreme events like wildfires. By using SHapley Additive exPlanations (SHAP), the researchers uncover key features, decision pathways, and potential biases in AI models, enhancing trust and enabling more effective decision-making by domain experts and response teams.
  • Dexterous Robotic Hands: "From Power to Precision: Learning Fine-grained Dexterity for Multi-fingered Robotic Hands" introduces a method for jointly optimizing the control and hardware design of multi-fingered hands. By introducing a lightweight fingertip geometry modification and optimizing its parameters along with the corresponding control, the method enables both power and precision manipulation.
  • Clinical Foundation Models: "Generalist Foundation Models Are Not Clinical Enough for Hospital Operations" introduces Lang1, a family of models pretrained on a specialized corpus of clinical data. The models are evaluated on the REalistic Medical Evaluation (ReMedE) benchmark, which evaluates five critical tasks. The results show that Lang1 outperforms generalist models, demonstrating that specialized LLMs can compete with generalist models in specialized tasks.
  • Self-Evolving Software Agents: "Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly?" introduces Live-SWE-agent, a live software agent that can autonomously and continuously evolve itself on-the-fly during runtime when solving real-world software problems. The agent starts with a basic scaffold and evolves its own implementation while solving problems, achieving impressive solve rates on the SWE-bench Verified and SWE-Bench Pro benchmarks.
  • Interpretable Circuits in Sparse Transformers: "Weight-sparse transformers have interpretable circuits" explores how to train models with more understandable circuits by constraining most of their weights to be zeros. By pruning the models to isolate the part responsible for the task, the researchers recover fine-grained circuits that often contain neurons and residual channels that correspond to natural concepts.

Model Spotlight

  • Supertone's Supertonic: This text-to-speech (TTS) system stands out for its speed and on-device capabilities. With 26 likes, Supertonic is designed for extreme performance with minimal computational overhead. It runs entirely on the device using ONNX Runtime, ensuring privacy and zero latency. Performance benchmarks show it generating speech significantly faster than real-time compared to other TTS systems and cloud-based APIs.
  • MiroThinker-v1.0-72B-AWQ-4bit: This agent model focuses on tool-augmented reasoning and information-seeking. MiroThinker introduces interactive scaling at the model level, training the model to handle deeper and more frequent agent–environment interactions. It supports a 256K context window and can handle up to 600 tool calls per task.
  • Foundation-Sec-1.1-8B-Instruct: This model is specialized for cybersecurity applications. It extends the Foundation-Sec-1.1-8B base model with instruction-following capabilities and an extended 64k context window. It is designed for security practitioners, researchers, and developers building AI-powered security workflows and applications.

Key Takeaways

  • Specialization Matters: The research on Lang1 highlights the importance of specialized models for specific domains, demonstrating that they can outperform generalist models in tasks requiring specialized knowledge.
  • Interactive Scaling: MiroThinker's

AI Papers for 2026-02-21

Sink-Aware Pruning for Diffusion Language Models

Diffusion Language Models (DLMs) incur high inference cost due to iterative denoising, motivating efficient pruning. Existing pruning heuristics largely inherited from autoregressive (AR) LLMs, typically preserve attention sink tokens because AR sinks serve as stable global anchors. We show that this assumption does not hold for DLMs: the attention-sink position exhibits substantially higher variance over the full generation trajectory (measured by how the dominant sink locations shift across timesteps), indicating that sinks are often transient and less structurally essential than in AR models. Based on this observation, we propose ${\bf \texttt{Sink-Aware Pruning}}$, which automatically identifies and prunes unstable sinks in DLMs (prior studies usually keep sinks for AR LLMs). Without retraining, our method achieves a better quality-efficiency trade-off and outperforms strong prior pruning baselines under matched compute. Our code is available at https://github.com/VILA-Lab/Sink-Aware-Pruning.

CLEF HIPE-2026: Evaluating Accurate and Efficient Person-Place Relation Extraction from Multilingual Historical Texts

HIPE-2026 is a CLEF evaluation lab dedicated to person-place relation extraction from noisy, multilingual historical texts. Building on the HIPE-2020 and HIPE-2022 campaigns, it extends the series toward semantic relation extraction by targeting the task of identifying person--place associations in multiple languages and time periods. Systems are asked to classify relations of two types - $at$ ("Has the person ever been at this place?") and $isAt$ ("Is the person located at this place around publication time?") - requiring reasoning over temporal and geographical cues. The lab introduces a three-fold evaluation profile that jointly assesses accuracy, computational efficiency, and domain generalization. By linking relation extraction to large-scale historical data processing, HIPE-2026 aims to support downstream applications in knowledge-graph construction, historical biography reconstruction, and spatial analysis in digital humanities.

MARS: Margin-Aware Reward-Modeling with Self-Refinement

Reward modeling is a core component of modern alignment pipelines including RLHF and RLAIF, underpinning policy optimization methods including PPO and TRPO. However, training reliable reward models relies heavily on human-labeled preference data, which is costly and limited, motivating the use of data augmentation. Existing augmentation approaches typically operate at the representation or semantic level and remain agnostic to the reward model's estimation difficulty. In this paper, we propose MARS, an adaptive, margin-aware augmentation and sampling strategy that explicitly targets ambiguous and failure modes of the reward model. Our proposed framework, MARS, concentrates augmentation on low-margin (ambiguous) preference pairs where the reward model is most uncertain, and iteratively refines the training distribution via hard-sample augmentation. We provide theoretical guarantees showing that this strategy increases the average curvature of the loss function hence enhance information and improves conditioning, along with empirical results demonstrating consistent gains over uniform augmentation for robust reward modeling.

Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting

Black-box adversarial attacks on Large Vision-Language Models (LVLMs) are challenging due to missing gradients and complex multimodal boundaries. While prior state-of-the-art transfer-based approaches like M-Attack perform well using local crop-level matching between source and target images, we find this induces high-variance, nearly orthogonal gradients across iterations, violating coherent local alignment and destabilizing optimization. We attribute this to (i) ViT translation sensitivity that yields spike-like gradients and (ii) structural asymmetry between source and target crops. We reformulate local matching as an asymmetric expectation over source transformations and target semantics, and build a gradient-denoising upgrade to M-Attack. On the source side, Multi-Crop Alignment (MCA) averages gradients from multiple independently sampled local views per iteration to reduce variance. On the target side, Auxiliary Target Alignment (ATA) replaces aggressive target augmentation with a small auxiliary set from a semantically correlated distribution, producing a smoother, lower-variance target manifold. We further reinterpret momentum as Patch Momentum, replaying historical crop gradients; combined with a refined patch-size ensemble (PE+), this strengthens transferable directions. Together these modules form M-Attack-V2, a simple, modular enhancement over M-Attack that substantially improves transfer-based black-box attacks on frontier LVLMs: boosting success rates on Claude-4.0 from 8% to 30%, Gemini-2.5-Pro from 83% to 97%, and GPT-5 from 98% to 100%, outperforming prior black-box LVLM attacks. Code and data are publicly available at: https://github.com/vila-lab/M-Attack-V2.

FAMOSE: A ReAct Approach to Automated Feature Discovery

Feature engineering remains a critical yet challenging bottleneck in machine learning, particularly for tabular data, as identifying optimal features from an exponentially large feature space traditionally demands substantial domain expertise. To address this challenge, we introduce FAMOSE (Feature AugMentation and Optimal Selection agEnt), a novel framework that leverages the ReAct paradigm to autonomously explore, generate, and refine features while integrating feature selection and evaluation tools within an agent architecture. To our knowledge, FAMOSE represents the first application of an agentic ReAct framework to automated feature engineering, especially for both regression and classification tasks. Extensive experiments demonstrate that FAMOSE is at or near the state-of-the-art on classification tasks (especially tasks with more than 10K instances, where ROC-AUC increases 0.23% on average), and achieves the state-of-the-art for regression tasks by reducing RMSE by 2.0% on average, while remaining more robust to errors than other algorithms. We hypothesize that FAMOSE's strong performance is because ReAct allows the LLM context window to record (via iterative feature discovery and evaluation steps) what features did or did not work. This is similar to a few-shot prompt and guides the LLM to invent better, more innovative features. Our work offers evidence that AI agents are remarkably effective in solving problems that require highly inventive solutions, such as feature engineering.

Reverso: Efficient Time Series Foundation Models for Zero-shot Forecasting

Learning time series foundation models has been shown to be a promising approach for zero-shot time series forecasting across diverse time series domains. Insofar as scaling has been a critical driver of performance of foundation models in other modalities such as language and vision, much recent work on time series foundation modeling has focused on scaling. This has resulted in time series foundation models with hundreds of millions of parameters that are, while performant, inefficient and expensive to use in practice. This paper describes a simple recipe for learning efficient foundation models for zero-shot time series forecasting that are orders of magnitude smaller. We show that large-scale transformers are not necessary: small hybrid models that interleave long convolution and linear RNN layers (in particular DeltaNet layers) can match the performance of larger transformer-based models while being more than a hundred times smaller. We also describe several data augmentation and inference strategies that further improve performance. This recipe results in Reverso, a family of efficient time series foundation models for zero-shot forecasting that significantly push the performance-efficiency Pareto frontier.

When to Trust the Cheap Check: Weak and Strong Verification for Reasoning

Reasoning with LLMs increasingly unfolds inside a broader verification loop. Internally, systems use cheap checks, such as self-consistency or proxy rewards, which we call weak verification. Externally, users inspect outputs and steer the model through feedback until results are trustworthy, which we call strong verification. These signals differ sharply in cost and reliability: strong verification can establish trust but is resource-intensive, while weak verification is fast and scalable but noisy and imperfect. We formalize this tension through weak--strong verification policies, which decide when to accept or reject based on weak verification and when to defer to strong verification. We introduce metrics capturing incorrect acceptance, incorrect rejection, and strong-verification frequency. Over population, we show that optimal policies admit a two-threshold structure and that calibration and sharpness govern the value of weak verifiers. Building on this, we develop an online algorithm that provably controls acceptance and rejection errors without assumptions on the query stream, the language model, or the weak verifier.

SMAC: Score-Matched Actor-Critics for Robust Offline-to-Online Transfer

Modern offline Reinforcement Learning (RL) methods find performant actor-critics, however, fine-tuning these actor-critics online with value-based RL algorithms typically causes immediate drops in performance. We provide evidence consistent with the hypothesis that, in the loss landscape, offline maxima for prior algorithms and online maxima are separated by low-performance valleys that gradient-based fine-tuning traverses. Following this, we present Score Matched Actor-Critic (SMAC), an offline RL method designed to learn actor-critics that transition to online value-based RL algorithms with no drop in performance. SMAC avoids valleys between offline and online maxima by regularizing the Q-function during the offline phase to respect a first-order derivative equality between the score of the policy and action-gradient of the Q-function. We experimentally demonstrate that SMAC converges to offline maxima that are connected to better online maxima via paths with monotonically increasing reward found by first-order optimization. SMAC achieves smooth transfer to Soft Actor-Critic and TD3 in 6/6 D4RL tasks. In 4/6 environments, it reduces regret by 34-58% over the best baseline.

Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs

Reinforcement learning (RL) is widely used to improve large language models on reasoning tasks, and asynchronous RL training is attractive because it increases end-to-end throughput. However, for widely adopted critic-free policy-gradient methods such as REINFORCE and GRPO, high asynchrony makes the policy-gradient estimator markedly $\textbf{higher variance}$: training on stale rollouts creates heavy-tailed importance ratios, causing a small fraction of samples to dominate updates. This amplification makes gradients noisy and learning unstable relative to matched on-policy training. Across math and general reasoning benchmarks, we find collapse is reliably predicted by effective sample size (ESS) and unstable gradient norms. Motivated by this diagnosis, we propose $\textbf{V}$ariance $\textbf{C}$ontrolled $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{VCPO}$), a general stabilization method for REINFORCE/GRPO-style algorithms that (i) scales learning rate based on effective sample size to dampen unreliable updates, and (ii) applies a closed-form minimum-variance baseline for the off-policy setting, avoiding an auxiliary value model and adding minimal overhead. Empirically, VCPO substantially improves robustness for asynchronous training across math, general reasoning, and tool-use tasks, outperforming a broad suite of baselines spanning masking/clipping stabilizers and algorithmic variants. This reduces long-context, multi-turn training time by 2.5$\times$ while matching synchronous performance, demonstrating that explicit control of policy-gradient variance is key for reliable asynchronous RL at scale.

Towards Anytime-Valid Statistical Watermarking

The proliferation of Large Language Models (LLMs) necessitates efficient mechanisms to distinguish machine-generated content from human text. While statistical watermarking has emerged as a promising solution, existing methods suffer from two critical limitations: the lack of a principled approach for selecting sampling distributions and the reliance on fixed-horizon hypothesis testing, which precludes valid early stopping. In this paper, we bridge this gap by developing the first e-value-based watermarking framework, Anchored E-Watermarking, that unifies optimal sampling with anytime-valid inference. Unlike traditional approaches where optional stopping invalidates Type-I error guarantees, our framework enables valid, anytime-inference by constructing a test supermartingale for the detection process. By leveraging an anchor distribution to approximate the target model, we characterize the optimal e-value with respect to the worst-case log-growth rate and derive the optimal expected stopping time. Our theoretical claims are substantiated by simulations and evaluations on established benchmarks, showing that our framework can significantly enhance sample efficiency, reducing the average token budget required for detection by 13-15% relative to state-of-the-art baselines.

AI Models

lukealonso/MiniMax-M2.5-REAP-139B-A10B-NVFP4


base_model:

  • cerebras/MiniMax-M2.5-REAP-139B-A10B
  • MiniMaxAI/MiniMax-M2.5 license: mit

Model Description

MiniMax-M2.5-NVFP4-REAP is an NVFP4-quantized version of cerebras/MiniMax-M2.5-REAP-139B-A10B, a 139B-parameter Mixture-of-Experts language model with 10B active parameters and 154 experts (pruned from 256 via REAP).

The REAP checkpoint's FP8 weights were first dequantized to BF16, then quantized to NVFP4 (4-bit with blockwise FP8 scales per 16 elements) using NVIDIA Model Optimizer.

The 40% expert pruning from REAP combined with NVFP4 quantization makes this model small enough to run on a single 96GB GPU.

What's quantized

Only the MoE expert MLP layers (gate, up, and down projections) are quantized to NVFP4. Attention layers are left in BF16. Since the expert weights constitute the vast majority of model parameters in an MoE architecture, this still yields significant memory savings.

Calibration uses natural top-k routing rather than forcing all experts to activate, so each expert's quantization scales reflect the token distributions it actually sees during inference. To compensate, calibration was run on a vastly larger number of samples than typical to ensure broad expert coverage through natural routing alone.

Calibration dataset

Samples were drawn from a diverse mix of publicly available datasets spanning code generation, function/tool calling, multi-turn reasoning, math, and multilingual (English + Chinese) instruction following. System prompts were randomly varied across samples. The dataset was designed to broadly exercise the model's capabilities and activate diverse token distributions across expert modules.

How to Run

Fits on a single RTX Pro 6000 Blackwell!

SGLang

export OMP_NUM_THREADS=8
export SAFETENSORS_FAST_GPU=1

python3 -m sglang.launch_server \
  --model lukealonso/MiniMax-M2.5-NVFP4-REAP \
  --served-model-name MiniMax-M2.5 \
  --reasoning-parser minimax \
  --tool-call-parser minimax-m2 \
  --trust-remote-code \
  --tp 1 \
  --mem-fraction-static 0.95 \
  --max-running-requests 32 \
  --quantization modelopt_fp4 \
  --attention-backend flashinfer \
  --moe-runner-backend flashinfer_cutlass \
  --kv-cache-dtype bf16 \
  --enable-flashinfer-allreduce-fusion \
  --host 0.0.0.0 \
  --port 8000

vLLM

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0
export SAFETENSORS_FAST_GPU=1
export VLLM_NVFP4_GEMM_BACKEND=cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=0
export NCCL_IB_DISABLE=1
export OMP_NUM_THREADS=8
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1

python -m vllm.entrypoints.openai.api_server \
  --model lukealonso/MiniMax-M2.5-NVFP4-REAP \
  --host 0.0.0.0 \
  --port 8000 \
  --served-model-name MiniMax-M2.5 \
  --trust-remote-code \
  --tensor-parallel-size 1 \
  --attention-backend FLASH_ATTN \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 64 \
  --disable-custom-all-reduce \
  --enable-auto-tool-choice \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2_append_think

Acknowledgments

See Also

Author: lukealonso

Likes: 5

Downloads: 0

Tags: safetensors, minimax_m2, custom_code, arxiv:2510.13999, base_model:MiniMaxAI/MiniMax-M2.5, base_model:quantized:MiniMaxAI/MiniMax-M2.5, license:mit, 8-bit, modelopt, region:us

mmnga-o/GPT-OSS-Swallow-20B-RL-v0.1-gguf


license: apache-2.0 language:

  • ja datasets:
  • TFMC/imatrix-dataset-for-japanese-llm base_model:
  • tokyotech-llm/GPT-OSS-Swallow-20B-RL-v0.1

GPT-OSS-Swallow-20B-RL-v0.1-gguf

tokyotech-llmさんが公開しているGPT-OSS-Swallow-20B-RL-v0.1のggufフォーマット変換版です。

imatrixのデータはTFMC/imatrix-dataset-for-japanese-llmを使用して作成しました。

Usage

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release
build/bin/llama-cli -m 'GPT-OSS-Swallow-20B-RL-v0.1-gguf' -n 128 -c 128 -p 'あなたはプロの料理人です。レシピを教えて' -cnv

Author: mmnga-o

Likes: 4

Downloads: 0

Tags: gguf, ja, dataset:TFMC/imatrix-dataset-for-japanese-llm, base_model:tokyotech-llm/GPT-OSS-Swallow-20B-RL-v0.1, base_model:quantized:tokyotech-llm/GPT-OSS-Swallow-20B-RL-v0.1, license:apache-2.0, endpoints_compatible, region:us, imatrix, conversational

ottomate/pocket-tts-ONNX


license: cc-by-4.0 library_name: onnxruntime base_model: kyutai/pocket-tts tags:

  • text-to-speech
  • onnx
  • onnxruntime
  • pocket-tts

ottomate/pocket-tts-ONNX

ONNX export artifacts for Pocket TTS.

Contents

  • tokenizer.json (for @huggingface/tokenizers compatibility)
  • onnx/flow_lm_main.onnx
  • onnx/flow_lm_flow.onnx
  • onnx/mimi_encoder.onnx
  • onnx/mimi_decoder.onnx
  • onnx/text_conditioner.onnx
  • optional *_int8.onnx quantized variants

Attribution

This repository is derived from Kyutai's Pocket TTS model: https://huggingface.co/kyutai/pocket-tts

License

This model card uses the same license field as the upstream model card: cc-by-4.0

Also follow upstream usage conditions and prohibited-use policy documented at: https://huggingface.co/kyutai/pocket-tts

Author: ottomate

Likes: 2

Downloads: 0

Tags: onnxruntime, onnx, text-to-speech, pocket-tts, base_model:kyutai/pocket-tts, base_model:quantized:kyutai/pocket-tts, license:cc-by-4.0, region:us

HiTZ/PL-BERT-wp-eu


license: apache-2.0 language:

  • eu tags:
  • TTS
  • PL-BERT
  • WordPiece
  • hitz-aholab

PL-BERT-eu

Overview

<details> <summary>Click to expand</summary> </details>

Model Description

PL-BERT-eu is a phoneme-level masked language model trained on Basque Wikipedia text. It is based on PL-BERT architecture and learns phoneme representations via a masked language modeling objective.

This model supports phoneme-based text-to-speech (TTS) systems, such as StyleTTS2 using Basque-specific phoneme vocabulary and contextual embeddings.

Features of our PL-BERT:

  • It is trained exclusively on Basque phonemized Wikipedia text.
  • It uses a reduced phoneme vocabulary of 178 tokens.
  • It utilizes a WordPiece tokenizer for phonemized Basque text.
  • It includes a custom token_maps_eu.pkl and adapted util.py.

Intended Uses and Limitations

Intended uses

  • Integration into phoneme-based TTS pipelines such as StyleTTS2.
  • Speech synthesis and phoneme embedding extraction for Basque.

Limitations

  • Not designed for general NLP tasks.
  • Only supports Basque phoneme tokens.

How to Get Started with the Model

Here is an example of how to use this model within the StyleTTS2 framework:

  1. Clone the StyleTTS2 repository: https://github.com/yl4579/StyleTTS2

  2. Inside the Utils directory, create a new folder, for example: PLBERT_eu.

  3. Copy the following files into that folder:

    • config.yml (training configuration)
    • step_4000000.t7 (trained checkpoint)
    • util.py (modified to fix position ID loading)
  4. In your StyleTTS2 configuration file, update the PLBERT_dir entry to:

    PLBERT_dir: Utils/PLBERT_eu

  5. Update the import statement in your code to:

    from Utils.PLBERT_eu.util import load_plbert

  6. We used code developed by Aholab to generate IPA phonemes for training the model. You can see a demo of the Basque phonemizer at arrandi/phonemizer-eus-esp. Likewise, the code used to generate IPA phonemes can be found in the phonemizer directory. We collapsed multi-character phonemes into single-character phonemes for better grapheme–phoneme alignment.

Note: If second-stage StyleTTS2 training produces a NaN loss when using a single GPU, see issue #254 in the original StyleTTS2 repository.


Training Details

Training data

The model was trained on a Basque corpus phonemized using Modelo1y2. It uses a consistent phoneme token set with boundary markers and masking tokens.

Tokenizer: custom (splits on whitespace)
Phoneme masking strategy: phoneme-level masking and replacement
Training steps: 4,000,000
Precision: mixed-precision (fp16)

Training configuration

Model parameters:

  • Vocabulary size: 178
  • Hidden size: 768
  • Attention heads: 12
  • Intermediate size: 2048
  • Number of layers: 12
  • Max position embeddings: 512
  • Dropout: 0.1
  • Embedding size: 128
  • Number of hidden groups: 1
  • Number of hidden layers per group: 12
  • Inner group number: 1
  • Downscale factor: 1

Other parameters:

  • Batch size: 32
  • Max mel length: 512
  • Word mask probability: 0.15
  • Phoneme mask probability: 0.1
  • Replacement probability: 0.2
  • Token separator: space
  • Token mask: M
  • Word separator ID: 2
  • Scheduler type: OneCycleLR
  • Learning rate: 0.0002
  • pct_start: 0.1
  • Annealing strategy: cosine annealing
  • div_factor: 25
  • final_div_factor: 10000

Evaluation

The model has been successfully integrated into StyleTTS2, where it enables the synthesis of Basque.


Citation

If this code contributes to your research, please cite the work:

@misc{aarriandiagaplberteu,
   title={PL-BERT-eu}, 
   author={Ander Arriandiaga and Ibon Saratxaga and Eva Navas and Inma Hernaez},
   organization={Hitz (Aholab) - EHU},
   url={https://huggingface.co/langtech-veu/PL-BERT-wp_es},
   year={2026}
}

Additional Information

Author

Author: Ander Arriandiaga — Aholab (Hitz), EHU

Contact

For further information, please send an email to inma.hernaez@ehu.eus.

Copyright

Copyright(c) 2026 by Aholab, HiTZ.

License

Apache-2.0

Funding

This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA.

Author: HiTZ

Likes: 2

Downloads: 0

Tags: TTS, PL-BERT, WordPiece, hitz-aholab, eu, license:apache-2.0, region:us

data-archetype/disco_diffae_p16_c64_m


license: apache-2.0 tags:

  • diffusion
  • autoencoder
  • image-reconstruction
  • pytorch library_name: disco-diffae

disco_diffae_p16_c64_m

Disco DiffAEDiffusion Tokenizer + SiD2 + Convolution. A compact diffusion autoencoder that encodes images to spatial latents and decodes them via iterative VP (variance-preserving) diffusion.

Model Variants

| Variant | Patch | Channels | Compression | | |---------|-------|----------|-------------|---| | disco_diffae_p16_c64_m | 16x16 | 64 | 12x | recommended | | disco_diffae_p16_c32_m | 16x16 | 32 | 24x | |

This variant (disco_diffae_p16_c64_m): 120.8M parameters, 461.0 MB.

Documentation

Quick Start

import torch
from disco_diffae import DiscoDiffAE

# Load from HuggingFace Hub (or a local path)
model = DiscoDiffAE.from_pretrained("disco_diffae_p16_c64_m", device="cuda")

# Encode
images = ...  # [B, 3, H, W] in [-1, 1], H and W divisible by 16
latents = model.encode(images)

# Decode
recon = model.decode(latents, height=H, width=W)

# Reconstruct (encode + decode)
recon = model.reconstruct(images)

Note: Requires pip install huggingface_hub safetensors for Hub downloads. You can also pass a local directory path to from_pretrained().

Architecture

| Property | Value | |---|---| | Parameters | 120,842,688 | | File size | 461.0 MB | | Patch size | 16 | | Model dim | 896 | | Encoder depth | 4 | | Decoder depth | 8 | | Bottleneck dim | 64 | | MLP ratio | 4.0 | | Depthwise kernel | 7 | | AdaLN rank | 128 |

Encoder: Deterministic. Patchify (PixelUnshuffle + 1x1 conv) followed by DiCo blocks (depthwise conv + compact channel attention + GELU MLP) with learned residual gates.

Decoder: VP diffusion conditioned on encoder latents and timestep via shared-base + per-layer low-rank AdaLN-Zero. Start blocks (2) -> middle blocks (4) -> skip fusion -> end blocks (2). Supports Path-Drop Guidance (PDG) at inference for quality/speed tradeoff.

Recommended Settings

Best quality is achieved with just 1 DDIM step and PDG disabled, making inference extremely fast. PDG (strength 2-4) can optionally increase perceptual sharpness but is easy to overdo.

| Setting | Default | |---|---| | Sampler | DDIM | | Steps | 1 | | PDG | Disabled |

from disco_diffae import DiscoDiffAEInferenceConfig

# PSNR-optimal (fast, 1 step)
cfg = DiscoDiffAEInferenceConfig(num_steps=1, sampler="ddim")
recon = model.decode(latents, height=H, width=W, inference_config=cfg)

Citation

@misc{disco_diffae_16_64,
  title   = {Disco DiffAE: Compact Diffusion Autoencoders with DiCo Blocks},
  author  = {data-archetype},
  year    = {2026},
  month   = feb,
  url     = {https://huggingface.co/disco_diffae_p16_c64_m},
}

Dependencies

  • PyTorch >= 2.0
  • safetensors (for loading weights)

License

Apache 2.0

Author: data-archetype

Likes: 2

Downloads: 0

Tags: disco-diffae, safetensors, diffusion, autoencoder, image-reconstruction, pytorch, license:apache-2.0, region:us

mmnga-o/Qwen3-Swallow-30B-A3B-SFT-v0.2-gguf


license: apache-2.0 language:

  • ja datasets:
  • TFMC/imatrix-dataset-for-japanese-llm base_model:
  • tokyotech-llm/Qwen3-Swallow-30B-A3B-SFT-v0.2

Qwen3-Swallow-30B-A3B-SFT-v0.2-gguf

tokyotech-llmさんが公開しているQwen3-Swallow-30B-A3B-SFT-v0.2のggufフォーマット変換版です。

imatrixのデータはTFMC/imatrix-dataset-for-japanese-llmを使用して作成しました。

Usage

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release
build/bin/llama-cli -m 'Qwen3-Swallow-30B-A3B-SFT-v0.2-gguf' -n 128 -c 128 -p 'あなたはプロの料理人です。レシピを教えて' -cnv

Author: mmnga-o

Likes: 2

Downloads: 0

Tags: gguf, ja, dataset:TFMC/imatrix-dataset-for-japanese-llm, base_model:tokyotech-llm/Qwen3-Swallow-30B-A3B-SFT-v0.2, base_model:quantized:tokyotech-llm/Qwen3-Swallow-30B-A3B-SFT-v0.2, license:apache-2.0, endpoints_compatible, region:us, imatrix, conversational

prithivMLmods/Qwen-Image-Edit-AIO-fp8


license: apache-2.0 language:

  • en base_model:
  • Qwen/Qwen-Image-Edit-2511
  • Qwen/Qwen-Image-Edit-2509 library_name: diffusers pipeline_tag: image-to-image tags:
  • qwen-image-edit
  • fp8
  • vllm
  • compressed-tensors
  • art

1

fp8 @ - models

Qwen-Image-Edit-AIO-fp8

Qwen-Image-Edit-AIO-fp8 is an FP8-compressed Transformers-only release built from Qwen-Image-Edit-2511 and Qwen-Image-Edit-2509. This repository provides FP8 (W8A8 · F8_E4M3) compressed transformer weights compatible with both Transformers and Diffusers pipelines, optimized for lower VRAM usage and higher throughput while maintaining strong editing fidelity.

Diffusers Usage

Qwen-Image-Edit-2509-fp8

import torch
from diffusers.models import QwenImageTransformer2DModel
from diffusers import QwenImageEditPlusPipeline
from diffusers.utils import load_image

# Load transformer (2509 version)
transformer = QwenImageTransformer2DModel.from_pretrained(
    "prithivMLmods/FireRed-Image-Edit-1.0-fp8",
    subfolder="Qwen-Image-Edit-2509-fp8/transformer",
    torch_dtype=torch.bfloat16
)

# Load pipeline
pipeline = QwenImageEditPlusPipeline.from_pretrained(
    "Qwen/Qwen-Image-Edit-2509",
    transformer=transformer,
    torch_dtype=torch.bfloat16
)

pipeline.to("cuda")

# Load image
image1 = load_image("grumpycat.png")
prompt = "turn the cat into an orange cat"

# Inference inputs
inputs = {
    "image": [image1],
    "prompt": prompt,
    "generator": torch.manual_seed(42),
    "true_cfg_scale": 1.0,
    "negative_prompt": " ",
    "num_inference_steps": 40,
    "guidance_scale": 1.0,
    "num_images_per_prompt": 1,
}

# Run pipeline
output = pipeline(**inputs)
output_image = output.images[0]
output_image.save("output_image_2509.png")

Qwen-Image-Edit-2511-fp8

import torch
from diffusers.models import QwenImageTransformer2DModel
from diffusers import QwenImageEditPlusPipeline
from diffusers.utils import load_image

# Load transformer (2511 version)
transformer = QwenImageTransformer2DModel.from_pretrained(
    "prithivMLmods/FireRed-Image-Edit-1.0-fp8",
    subfolder="Qwen-Image-Edit-2511-fp8/transformer",
    torch_dtype=torch.bfloat16
)

# Load pipeline
pipeline = QwenImageEditPlusPipeline.from_pretrained(
    "Qwen/Qwen-Image-Edit-2511",
    transformer=transformer,
    torch_dtype=torch.bfloat16
)

pipeline.to("cuda")

# Load image
image1 = load_image("grumpycat.png")
prompt = "turn the cat into an orange cat"

# Inference inputs
inputs = {
    "image": [image1],
    "prompt": prompt,
    "generator": torch.manual_seed(42),
    "true_cfg_scale": 1.0,
    "negative_prompt": " ",
    "num_inference_steps": 40,
    "guidance_scale": 1.0,
    "num_images_per_prompt": 1,
}

# Run pipeline
output = pipeline(**inputs)
output_image = output.images[0]
output_image.save("output_image_2511.png")

About the Base Models

Qwen-Image-Edit-2511 is an advanced iteration over Qwen-Image-Edit-2509, a production-grade 20B-parameter image editing model developed by Alibaba's Qwen team.

It is built on the Qwen-Image MMDiT architecture with VL-Qwen2.5 integration and VAE encoding for high-fidelity real-world editing workflows.

Architecture Overview

<table align="center"> <tr> <td align="center"> <b>DiT / MMDiT Block</b><br><br> <img src="https://www.researchgate.net/publication/390354569/figure/fig1/AS%3A11431281339077051%401743478379143/DiT-and-MMDiT-block-architecture-In-MMDiT-after-projections-visual-and-text-tokens-are.ppm" width="350"/> </td> <td align="center"> <b>U-Net Architecture</b><br><br> <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/unet-model.png" width="350"/> </td> </tr> <tr> <td align="center"> <b>Cross-Attention Mechanism</b><br><br> <img src="https://miro.medium.com/v2/resize%3Afit%3A1400/1%2AkXiln_TbF15oVg7AjcUEkQ.png" width="350"/> </td> <td align="center"> <b>Variational Autoencoder (VAE)</b><br><br> <img src="https://www.researchgate.net/publication/374693185/figure/fig1/AS%3A11431281198179610%401697206307688/Simplified-schematic-of-the-VAE-variational-autoencoder-VAE-models-Fig-1-are.png" width="350"/> </td> </tr> </table>
  • Qwen-Image MMDiT backbone
  • VL-Qwen2.5 multimodal integration
  • Latent VAE encoding
  • Diffusion-based editing pipeline
  • Compatible with Diffusers and ComfyUI workflows

What Makes 2511 Stronger Than 2509

Multi-Person & Identity Consistency

  • Reduced identity swaps in group photos
  • Improved pose consistency
  • Stable multi-subject generation

High-Fidelity Iterative Editing

  • Strong identity preservation across multiple edits
  • Reduced image drift
  • Reliable prompt alignment

Dual-Mode Editing Control

  • Appearance editing for localized modifications
  • Semantic editing for global transformations

Precise Text Handling

  • Natural typography rendering
  • Accurate on-image text replacement
  • Reduced distortions in signage or UI text

Enhanced Structural Reasoning

  • Better geometric alignment
  • Accurate object replacement
  • Material-aware transformations

Industrial & Commercial Use

  • Product design material swaps
  • Clean geometry preservation
  • Suitable for e-commerce and marketing workflows
  • Reliable outputs for creative production pipelines

What This FP8 AIO Version Provides

Qwen-Image-Edit-AIO-fp8 introduces:

  • FP8 (W8A8 · F8_E4M3) transformer compression
  • Reduced VRAM requirements
  • Higher inference throughput
  • Transformers-native loading
  • Diffusers-compatible transformer weights
  • Optimized for Hopper-class and compatible GPUs

This release focuses strictly on compressed transformer weights while preserving original editing capabilities.

Intended Workflows

  • ComfyUI pipelines
  • Diffusers-based applications
  • Production-grade editing systems
  • Batch commercial pipelines
  • E-commerce product editing
  • Marketing creative generation

Limitations

  • FP8 acceleration requires compatible GPU architectures.
  • Extremely fine-grained edge cases may show minor precision differences compared to full BF16.
  • Users are responsible for lawful and ethical deployment.

Author: prithivMLmods

Likes: 2

Downloads: 0

Tags: diffusers, safetensors, qwen-image-edit, fp8, vllm, compressed-tensors, art, image-to-image, en, base_model:Qwen/Qwen-Image-Edit-2509, base_model:finetune:Qwen/Qwen-Image-Edit-2509, license:apache-2.0, region:us

artificialguybr/3DRenderStyle-REDMOND-ZIMAGE


tags:

  • text-to-image
  • lora
  • diffusers
  • template:diffusion-lora widget:
  • output: url: images/014.png text: '-'
  • output: url: images/013.png text: '-'
  • output: url: images/011.png text: '-'
  • output: url: images/010.png text: '-'
  • output: url: images/009.png text: '-'
  • output: url: images/008.png text: '-'
  • output: url: images/007.png text: '-'
  • output: url: images/006.png text: '-'
  • output: url: images/005.png text: '-'
  • output: url: images/004.png text: '-'
  • output: url: images/003.png text: '-'
  • output: url: images/002.png text: '-'
  • output: url: images/001.png text: '-' base_model: Tongyi-MAI/Z-Image-Turbo instance_prompt: 3D Render Style, 3DRenderAF license: apache-2.0

3D RENDER STYLE REDMOND LORA FOR Z IAMGE TURBO

<Gallery />

Model description

#3D Render Style

I'm grateful for the GPU time from Redmond.AI that allowed me to make this model!

This LoRA was trained on 3D Render Style style images. It generates high-quality 3d render style content with excellent consistency.

3D Render Style style images with high detail and consistency

I really hope you like the model and use it!

Trigger words

You should use `3D Render Style. 3DRenderAF.` to trigger the image generation.

Download model

Weight for this model is available in Safetensors format.

How to use

I recommend using this LoRA with ComfyUI for the best results.

License

This model is licensed under the Apache License 2.0


Support My Work

If you like the model and think it's worth it, you can make a donation to support my work:

Follow me on Twitter to get early access to all my new models: @artificialguybr

Visit my website: artificialguy.com

Trigger words

You should use 3D Render Style to trigger the image generation.

You should use 3DRenderAF to trigger the image generation.

Download model

Download them in the Files & versions tab.

Author: artificialguybr

Likes: 2

Downloads: 0

Tags: diffusers, text-to-image, lora, template:diffusion-lora, base_model:Tongyi-MAI/Z-Image-Turbo, base_model:adapter:Tongyi-MAI/Z-Image-Turbo, license:apache-2.0, region:us

artificialguybr/ICONS-REDMOND-ZIMAGETURBO


tags:

  • text-to-image
  • lora
  • diffusers
  • template:diffusion-lora widget:
  • output: url: images/015.png text: '-'
  • output: url: images/014.png text: '-'
  • output: url: images/010.png text: '-'
  • output: url: images/008.png text: '-'
  • output: url: images/004.png text: '-'
  • output: url: images/003.png text: '-'
  • output: url: images/002.png text: '-'
  • output: url: images/001.png text: '-' base_model: Tongyi-MAI/Z-Image-Turbo instance_prompt: ICREDM, ICONS license: apache-2.0

ICONS REDMOND FOR ZIMAGETURBO

<Gallery />

Model description

#App Icons is here!

I'm grateful for the GPU time from Redmond.AI that allowed me to make this model!

This LoRA was trained on App Icons style images. It generates high-quality app icons content with excellent consistency.

App Icons style images with high detail and consistency

I really hope you like the model and use it!

Trigger words

You should use `ICONS. ICREDM.` to trigger the image generation.

Download model

Weight for this model is available in Safetensors format.

How to use

I recommend using this LoRA with ComfyUI for the best results.

License

This model is licensed under the Apache License 2.0


Support My Work

If you like the model and think it's worth it, you can make a donation to support my work:

Follow me on Twitter to get early access to all my new models: @artificialguybr

Visit my website: artificialguy.com

Trigger words

You should use ICREDM to trigger the image generation.

You should use ICONS to trigger the image generation.

Download model

Download them in the Files & versions tab.

Author: artificialguybr

Likes: 2

Downloads: 0

Tags: diffusers, text-to-image, lora, template:diffusion-lora, base_model:Tongyi-MAI/Z-Image-Turbo, base_model:adapter:Tongyi-MAI/Z-Image-Turbo, license:apache-2.0, region:us

heretic-org/gemma-3-4b-it-heretic


license: gemma library_name: transformers pipeline_tag: image-text-to-text extra_gated_heading: Access Gemma on Hugging Face extra_gated_prompt: >- To access Gemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging Face and click below. Requests are processed immediately. extra_gated_button_content: Acknowledge license base_model: google/gemma-3-4b-it tags:

  • heretic
  • uncensored
  • decensored
  • abliterated
  • gemma-3

This is a decensored version of google/gemma-3-4b-it, made using Heretic v1.2.0

Abliteration parameters

| Parameter | Value | | :-------- | :---: | | direction_index | per layer | | attn.o_proj.max_weight | 1.37 | | attn.o_proj.max_weight_position | 20.68 | | attn.o_proj.min_weight | 1.34 | | attn.o_proj.min_weight_distance | 13.04 | | mlp.down_proj.max_weight | 1.30 | | mlp.down_proj.max_weight_position | 23.84 | | mlp.down_proj.min_weight | 0.88 | | mlp.down_proj.min_weight_distance | 13.81 |

Performance

| Metric | This model | Original model (google/gemma-3-4b-it) | | :----- | :--------: | :---------------------------: | | KL divergence | 0.1056 | 0 (by definition) | | Refusals | 28/100 | 99/100 |


Gemma 3 model card

Model Page: Gemma

Resources and Technical Documentation:

Terms of Use: Terms

Authors: Google DeepMind

Model Information

Summary description and brief definition of inputs and outputs.

Description

Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. Gemma 3 models are multimodal, handling text and image input and generating text output, with open weights for both pre-trained variants and instruction-tuned variants. Gemma 3 has a large, 128K context window, multilingual support in over 140 languages, and is available in more sizes than previous versions. Gemma 3 models are well-suited for a variety of text generation and image understanding tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as laptops, desktops or your own cloud infrastructure, democratizing access to state of the art AI models and helping foster innovation for everyone.

Inputs and outputs

  • Input:

    • Text string, such as a question, a prompt, or a document to be summarized
    • Images, normalized to 896 x 896 resolution and encoded to 256 tokens each
    • Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B size
  • Output:

    • Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document
    • Total output context of 8192 tokens

Usage

Below, there are some code snippets on how to get quickly started with running the model. First, install the Transformers library. Gemma 3 is supported starting from transformers 4.50.0.

$ pip install -U transformers

Then, copy the snippet from the section that is relevant for your use case.

Running with the pipeline API

You can initialize the model and processor for inference with pipeline as follows.

from transformers import pipeline
import torch

pipe = pipeline(
    "image-text-to-text",
    model="google/gemma-3-4b-it",
    device="cuda",
    torch_dtype=torch.bfloat16
)

With instruction-tuned models, you need to use chat templates to process our inputs first. Then, you can pass it to the pipeline.

messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}]
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    }
]

output = pipe(text=messages, max_new_tokens=200)
print(output[0]["generated_text"][-1]["content"])
# Okay, let's take a look! 
# Based on the image, the animal on the candy is a **turtle**. 
# You can see the shell shape and the head and legs.

Running the model on a single/multi GPU

# pip install accelerate

from transformers import AutoProcessor, Gemma3ForConditionalGeneration
from PIL import Image
import requests
import torch

model_id = "google/gemma-3-4b-it"

model = Gemma3ForConditionalGeneration.from_pretrained(
    model_id, device_map="auto"
).eval()

processor = AutoProcessor.from_pretrained(model_id)

messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}]
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }
]

inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)

input_len = inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]

decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)

# **Overall Impression:** The image is a close-up shot of a vibrant garden scene, 
# focusing on a cluster of pink cosmos flowers and a busy bumblebee. 
# It has a slightly soft, natural feel, likely captured in daylight.

Citation

@article{gemma_2025,
    title={Gemma 3},
    url={https://goo.gle/Gemma3Report},
    publisher={Kaggle},
    author={Gemma Team},
    year={2025}
}

Model Data

Data used for model training and how the data was processed.

Training Dataset

These models were trained on a dataset of text data that includes a wide variety of sources. The 27B model was trained with 14 trillion tokens, the 12B model was trained with 12 trillion tokens, 4B model was trained with 4 trillion tokens and 1B with 2 trillion tokens. Here are the key components:

  • Web Documents: A diverse collection of web text ensures the model is exposed to a broad range of linguistic styles, topics, and vocabulary. The training dataset includes content in over 140 languages.
  • Code: Exposing the model to code helps it to learn the syntax and patterns of programming languages, which improves its ability to generate code and understand code-related questions.
  • Mathematics: Training on mathematical text helps the model learn logical reasoning, symbolic representation, and to address mathematical queries.
  • Images: A wide range of images enables the model to perform image analysis and visual data extraction tasks.

The combination of these diverse data sources is crucial for training a powerful multimodal model that can handle a wide variety of different tasks and data formats.

Data Preprocessing

Here are the key data cleaning and filtering methods applied to the training data:

  • CSAM Filtering: Rigorous CSAM (Child Sexual Abuse Material) filtering was applied at multiple stages in the data preparation process to ensure the exclusion of harmful and illegal content.
  • Sensitive Data Filtering: As part of making Gemma pre-trained models safe and reliable, automated techniques were used to filter out certain personal information and other sensitive data from training sets.
  • Additional methods: Filtering based on content quality and safety in line with our policies.

Implementation Information

Details about the model internals.

Hardware

Gemma was trained using Tensor Processing Unit (TPU) hardware (TPUv4p, TPUv5p and TPUv5e). Training vision-language models (VLMS) requires significant computational power. TPUs, designed specifically for matrix operations common in machine learning, offer several advantages in this domain:

  • Performance: TPUs are specifically designed to handle the massive computations involved in training VLMs. They can speed up training considerably compared to CPUs.
  • Memory: TPUs often come with large amounts of high-bandwidth memory, allowing for the handling of large models and batch sizes during training. This can lead to better model quality.
  • Scalability: TPU Pods (large clusters of TPUs) provide a scalable solution for handling the growing complexity of large foundation models. You can distribute training across multiple TPU devices for faster and more efficient processing.
  • Cost-effectiveness: In many scenarios, TPUs can provide a more cost-effective solution for training large models compared to CPU-based infrastructure, especially when considering the time and resources saved due to faster training.
  • These advantages are aligned with Google's commitments to operate sustainably.

Software

Training was done using JAX and ML Pathways.

JAX allows researchers to take advantage of the latest generation of hardware, including TPUs, for faster and more efficient training of large models. ML Pathways is Google's latest effort to build artificially intelligent systems capable of generalizing across multiple tasks. This is specially suitable for foundation models, including large language models like these ones.

Together, JAX and ML Pathways are used as described in the paper about the Gemini family of models; "the 'single controller' programming model of Jax and Pathways allows a single Python process to orchestrate the entire training run, dramatically simplifying the development workflow."

Evaluation

Model evaluation metrics and results.

Benchmark Results

These models were evaluated against a large collection of different datasets and metrics to cover different aspects of text generation:

Reasoning and factuality

| Benchmark | Metric | Gemma 3 PT 1B | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------ |----------------|:--------------:|:-------------:|:--------------:|:--------------:| | HellaSwag | 10-shot | 62.3 | 77.2 | 84.2 | 85.6 | | BoolQ | 0-shot | 63.2 | 72.3 | 78.8 | 82.4 | | PIQA | 0-shot | 73.8 | 79.6 | 81.8 | 83.3 | | SocialIQA | 0-shot | 48.9 | 51.9 | 53.4 | 54.9 | | TriviaQA | 5-shot | 39.8 | 65.8 | 78.2 | 85.5 | | Natural Questions | 5-shot | 9.48 | 20.0 | 31.4 | 36.1 | | ARC-c | 25-shot | 38.4 | 56.2 | 68.9 | 70.6 | | ARC-e | 0-shot | 73.0 | 82.4 | 88.3 | 89.0 | | WinoGrande | 5-shot | 58.2 | 64.7 | 74.3 | 78.8 | | BIG-Bench Hard | few-shot | 28.4 | 50.9 | 72.6 | 77.7 | | DROP | 1-shot | 42.4 | 60.1 | 72.2 | 77.2 |

STEM and code

| Benchmark | Metric | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------ |----------------|:-------------:|:--------------:|:--------------:| | MMLU | 5-shot | 59.6 | 74.5 | 78.6 | | MMLU (Pro COT) | 5-shot | 29.2 | 45.3 | 52.2 | | AGIEval | 3-5-shot | 42.1 | 57.4 | 66.2 | | MATH | 4-shot | 24.2 | 43.3 | 50.0 | | GSM8K | 8-shot | 38.4 | 71.0 | 82.6 | | GPQA | 5-shot | 15.0 | 25.4 | 24.3 | | MBPP | 3-shot | 46.0 | 60.4 | 65.6 | | HumanEval | 0-shot | 36.0 | 45.7 | 48.8 |

Multilingual

| Benchmark | Gemma 3 PT 1B | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------------ |:-------------:|:-------------:|:--------------:|:--------------:| | MGSM | 2.04 | 34.7 | 64.3 | 74.3 | | Global-MMLU-Lite | 24.9 | 57.0 | 69.4 | 75.7 | | WMT24++ (ChrF) | 36.7 | 48.4 | 53.9 | 55.7 | | FloRes | 29.5 | 39.2 | 46.0 | 48.8 | | XQuAD (all) | 43.9 | 68.0 | 74.5 | 76.8 | | ECLeKTic | 4.69 | 11.0 | 17.2 | 24.4 | | IndicGenBench | 41.4 | 57.2 | 61.7 | 63.4 |

Multimodal

| Benchmark | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------ |:-------------:|:--------------:|:--------------:| | COCOcap | 102 | 111 | 116 | | DocVQA (val) | 72.8 | 82.3 | 85.6 | | InfoVQA (val) | 44.1 | 54.8 | 59.4 | | MMMU (pt) | 39.2 | 50.3 | 56.1 | | TextVQA (val) | 58.9 | 66.5 | 68.6 | | RealWorldQA | 45.5 | 52.2 | 53.9 | | ReMI | 27.3 | 38.5 | 44.8 | | AI2D | 63.2 | 75.2 | 79.0 | | ChartQA | 63.6 | 74.7 | 76.3 | | VQAv2 | 63.9 | 71.2 | 72.9 | | BLINK | 38.0 | 35.9 | 39.6 | | OKVQA | 51.0 | 58.7 | 60.2 | | TallyQA | 42.5 | 51.8 | 54.3 | | SpatialSense VQA | 50.9 | 60.0 | 59.4 | | CountBenchQA | 26.1 | 17.8 | 68.0 |

Ethics and Safety

Ethics and safety evaluation approach and results.

Evaluation Approach

Our evaluation methods include structured evaluations and internal red-teaming testing of relevant content policies. Red-teaming was conducted by a number of different teams, each with different goals and human evaluation metrics. These models were evaluated against a number of different categories relevant to ethics and safety, including:

  • Child Safety: Evaluation of text-to-text and image to text prompts covering child safety policies, including child sexual abuse and exploitation.
  • Content Safety: Evaluation of text-to-text and image to text prompts covering safety policies including, harassment, violence and gore, and hate speech.
  • Representational Harms: Evaluation of text-to-text and image to text prompts covering safety policies including bias, stereotyping, and harmful associations or inaccuracies.

In addition to development level evaluations, we conduct "assurance evaluations" which are our 'arms-length' internal evaluations for responsibility governance decision making. They are conducted separately from the model development team, to inform decision making about release. High level findings are fed back to the model team, but prompt sets are held-out to prevent overfitting and preserve the results' ability to inform decision making. Assurance evaluation results are reported to our Responsibility & Safety Council as part of release review.

Evaluation Results

For all areas of safety testing, we saw major improvements in the categories of child safety, content safety, and representational harms relative to previous Gemma models. All testing was conducted without safety filters to evaluate the model capabilities and behaviors. For both text-to-text and image-to-text, and across all model sizes, the model produced minimal policy violations, and showed significant improvements over previous Gemma models' performance with respect to ungrounded inferences. A limitation of our evaluations was they included only English language prompts.

Usage and Limitations

These models have certain limitations that users should be aware of.

Intended Usage

Open vision-language models (VLMs) models have a wide range of applications across various industries and domains. The following list of potential uses is not comprehensive. The purpose of this list is to provide contextual information about the possible use-cases that the model creators considered as part of model training and development.

  • Content Creation and Communication
    • Text Generation: These models can be used to generate creative text formats such as poems, scripts, code, marketing copy, and email drafts.
    • Chatbots and Conversational AI: Power conversational interfaces for customer service, virtual assistants, or interactive applications.
    • Text Summarization: Generate concise summaries of a text corpus, research papers, or reports.
    • Image Data Extraction: These models can be used to extract, interpret, and summarize visual data for text communications.
  • Research and Education
    • Natural Language Processing (NLP) and VLM Research: These models can serve as a foundation for researchers to experiment with VLM and NLP techniques, develop algorithms, and contribute to the advancement of the field.
    • Language Learning Tools: Support interactive language learning experiences, aiding in grammar correction or providing writing practice.
    • Knowledge Exploration: Assist researchers in exploring large bodies of text by generating summaries or answering questions about specific topics.

Limitations

  • Training Data
    • The quality and diversity of the training data significantly influence the model's capabilities. Biases or gaps in the training data can lead to limitations in the model's responses.
    • The scope of the training dataset determines the subject areas the model can handle effectively.
  • Context and Task Complexity
    • Models are better at tasks that can be framed with clear prompts and instructions. Open-ended or highly complex tasks might be challenging.
    • A model's performance can be influenced by the amount of context provided (longer context generally leads to better outputs, up to a certain point).
  • Language Ambiguity and Nuance
    • Natural language is inherently complex. Models might struggle to grasp subtle nuances, sarcasm, or figurative language.
  • Factual Accuracy
    • Models generate responses based on information they learned from their training datasets, but they are not knowledge bases. They may generate incorrect or outdated factual statements.
  • Common Sense
    • Models rely on statistical patterns in language. They might lack the ability to apply common sense reasoning in certain situations.

Ethical Considerations and Risks

The development of vision-language models (VLMs) raises several ethical concerns. In creating an open model, we have carefully considered the following:

  • Bias and Fairness
    • VLMs trained on large-scale, real-world text and image data can reflect socio-cultural biases embedded in the training material. These models underwent careful scrutiny, input data pre-processing described and posterior evaluations reported in this card.
  • Misinformation and Misuse
    • VLMs can be misused to generate text that is false, misleading, or harmful.
    • Guidelines are provided for responsible use with the model, see the Responsible Generative AI Toolkit.
  • Transparency and Accountability:
    • This model card summarizes details on the models' architecture, capabilities, limitations, and evaluation processes.
    • A responsibly developed open model offers the opportunity to share innovation by making VLM technology accessible to developers and researchers across the AI ecosystem.

Risks identified and mitigations:

  • Perpetuation of biases: It's encouraged to perform continuous monitoring (using evaluation metrics, human review) and the exploration of de-biasing techniques during model training, fine-tuning, and other use cases.
  • Generation of harmful content: Mechanisms and guidelines for content safety are essential. Developers are encouraged to exercise caution and implement appropriate content safety safeguards based on their specific product policies and application use cases.
  • Misuse for malicious purposes: Technical limitations and developer and end-user education can help mitigate against malicious applications of VLMs. Educational resources and reporting mechanisms for users to flag misuse are provided. Prohibited uses of Gemma models are outlined in the Gemma Prohibited Use Policy.
  • Privacy violations: Models were trained on data filtered for removal of certain personal information and other sensitive data. Developers are encouraged to adhere to privacy regulations with privacy-preserving techniques.

Benefits

At the time of release, this family of models provides high-performance open vision-language model implementations designed from the ground up for responsible AI development compared to similarly sized models.

Using the benchmark evaluation metrics described in this document, these models have shown to provide superior performance to other, comparably-sized open model alternatives.

Author: heretic-org

Likes: 2

Downloads: 19

Tags: transformers, safetensors, gemma3, image-text-to-text, heretic, uncensored, decensored, abliterated, gemma-3, conversational, arxiv:1905.07830, arxiv:1905.10044, arxiv:1911.11641, arxiv:1904.09728, arxiv:1705.03551, arxiv:1911.01547, arxiv:1907.10641, arxiv:1903.00161, arxiv:2009.03300, arxiv:2304.06364, arxiv:2103.03874, arxiv:2110.14168, arxiv:2311.12022, arxiv:2108.07732, arxiv:2107.03374, arxiv:2210.03057, arxiv:2106.03193, arxiv:1910.11856, arxiv:2502.12404, arxiv:2502.21228, arxiv:2404.16816, arxiv:2104.12756, arxiv:2311.16502, arxiv:2203.10244, arxiv:2404.12390, arxiv:1810.12440, arxiv:1908.02660, arxiv:2312.11805, base_model:google/gemma-3-4b-it, base_model:finetune:google/gemma-3-4b-it, license:gemma, text-generation-inference, endpoints_compatible, region:us