Todays AI Summary

AI Developments: GPT-4o Synthetic Data, Generalized Humanoid Control, and More

Here's a look at some of the most interesting AI developments from today, focusing on image generation, robotics, and language models.

Research Highlights

  • Echo-4o: GPT-4o Synthetic Images for Improved Image Generation (arXiv:2508.09987): This paper explores the use of synthetic images generated by GPT-4o to improve open-source image generation models. The key insight is that synthetic images can complement real-world datasets by covering rare scenarios and providing cleaner supervision. The authors introduce Echo-4o-Image, a 180K-scale synthetic dataset, and demonstrate its effectiveness in fine-tuning models like Bagel. They also propose new evaluation benchmarks, GenEval++ and Imagine-Bench, for more challenging assessments.
  • Generalized Behavior-Cloning Framework for Whole-Body Humanoid Imitation (arXiv:2508.09960): This research addresses the challenge of creating human-like humanoid robots by introducing the Generalized Behavior Cloning (GBC) framework. GBC provides a unified solution for transferring human motion to different robot morphologies. It includes an adaptive data pipeline, a novel DAgger-MMPPO algorithm, and an open-source platform based on Isaac Lab. The framework is validated by training policies on multiple heterogeneous humanoids, demonstrating excellent performance and transferability.
  • VisCodex: Unified Multimodal Code Generation (arXiv:2508.09945): This paper introduces VisCodex, a framework that merges vision and coding language models to enhance multimodal code generation. VisCodex uses a task vector-based model merging technique to integrate a coding LLM into a vision-language backbone. The authors also introduce the Multimodal Coding Dataset (MCD) and the InfiBench-V benchmark for evaluation. Experiments show that VisCodex achieves state-of-the-art performance among open-source MLLMs and approaches proprietary models like GPT-4o.
  • January Food Benchmark (JFB): A Public Benchmark Dataset and Evaluation Suite for Multimodal Food Analysis (arXiv:2508.09966): This paper introduces the January Food Benchmark (JFB), a publicly available collection of 1,000 food images with human-validated annotations. It also details a comprehensive benchmarking framework, including robust metrics and a novel, application-oriented overall score designed to assess model performance holistically.

Model Updates

  • NextStep-1-f8ch16-Tokenizer: Stepfun-ai has released an improved image tokenizer for NextStep-1. This tokenizer features a fine-tuned decoder with a frozen encoder, leading to improved performance and robust reconstruction quality. Evaluation on ImageNet-1K 256x256 shows competitive PSNR and SSIM scores compared to other tokenizers.
  • Gemma-3-270M Models: Several new models based on Google's Gemma-3-270M have been released, including GGUF versions by ggml-org and lmstudio-community, and a MLX version by lmstudio-community. These models offer various quantization options and are designed for conversational AI and text generation tasks. The litert-community also released a Gemma-3-270M-it model, highlighting its multimodal capabilities and strong performance on various benchmarks.

Key Takeaways

  • Synthetic Data Value: Synthetic data, particularly from models like GPT-4o, is proving valuable for augmenting real-world datasets and improving model performance in specific areas.
  • Humanoid Robotics Advancements: The Generalized Behavior Cloning framework represents a significant step towards creating more versatile and human-like humanoid robots.
  • Multimodal Code Generation: VisCodex demonstrates the potential of merging vision and coding models for advanced multimodal code generation, bridging the gap between visual understanding and code creation.
  • Gemma-3 Ecosystem Growth: The release of multiple Gemma-3-270M variants highlights the growing ecosystem around this model family, offering developers a range of options for different hardware and use cases.

AI Papers for 2026-04-15

Physics-Informed State Space Models for Reliable Solar Irradiance Forecasting in Off-Grid Systems

The stable operation of autonomous off-grid photovoltaic systems dictates reliance on solar forecasting algorithms that respect atmospheric thermodynamics. Contemporary deep learning models consistently exhibit critical anomalies, primarily severe temporal phase lags during cloud transients and physically impossible nocturnal power generation. To resolve this divergence between data-driven modeling and deterministic celestial mechanics, this research introduces the Thermodynamic Liquid Manifold Network. The proposed methodology projects 15 meteorological and geometric variables into a Koopman-linearized Riemannian manifold to systematically map complex climatic dynamics. The architecture integrates a Spectral Calibration unit and a multiplicative Thermodynamic Alpha-Gate. This system synthesizes real-time atmospheric opacity with theoretical clear-sky boundary models, structurally enforcing strict celestial geometry compliance. This completely neutralizes phantom nocturnal generation while maintaining zero-lag synchronization during rapid weather shifts. Validated against a rigorous five-year testing horizon in a severe semi-arid climate, the framework achieves an RMSE of 18.31 Wh/m2 and a Pearson correlation of 0.988. The model strictly maintains a zero-magnitude nocturnal error across all 1826 testing days and exhibits a sub-30-minute phase response during high-frequency transients. Comprising exactly 63,458 trainable parameters, this ultra-lightweight design establishes a robust, thermodynamically consistent standard for edge-deployable microgrid controllers.

Detecting Safety Violations Across Many Agent Traces

To identify safety violations, auditors often search over large sets of agent traces. This search is difficult because failures are often rare, complex, and sometimes even adversarially hidden and only detectable when multiple traces are analyzed together. These challenges arise in diverse settings such as misuse campaigns, covert sabotage, reward hacking, and prompt injection. Existing approaches struggle here for several reasons. Per-trace judges miss failures that only become visible across traces, naive agentic auditing does not scale to large trace collections, and fixed monitors are brittle to unanticipated behaviors. We introduce Meerkat, which combines clustering with agentic search to uncover violations specified in natural language. Through structured search and adaptive investigation of promising regions, Meerkat finds sparse failures without relying on seed scenarios, fixed workflows, or exhaustive enumeration. Across misuse, misalignment, and task gaming settings, Meerkat significantly improves detection of safety violations over baseline monitors, discovers widespread developer cheating on a top agent benchmark, and finds nearly 4x more examples of reward hacking on CyBench than previous audits.

Solving Physics Olympiad via Reinforcement Learning on Physics Simulators

We have witnessed remarkable advances in LLM reasoning capabilities with the advent of DeepSeek-R1. However, much of this progress has been fueled by the abundance of internet question-answer (QA) pairs, a major bottleneck going forward, since such data is limited in scale and concentrated mainly in domains like mathematics. In contrast, other sciences such as physics lack large-scale QA datasets to effectively train reasoning-capable models. In this work, we show that physics simulators can serve as a powerful alternative source of supervision for training LLMs for physical reasoning. We generate random scenes in physics engines, create synthetic question-answer pairs from simulated interactions, and train LLMs using reinforcement learning on this synthetic data. Our models exhibit zero-shot sim-to-real transfer to real-world physics benchmarks: for example, training solely on synthetic simulated data improves performance on IPhO (International Physics Olympiad) problems by 5-10 percentage points across model sizes. These results demonstrate that physics simulators can act as scalable data generators, enabling LLMs to acquire deep physical reasoning skills beyond the limitations of internet-scale QA data. Code available at: https://sim2reason.github.io/.

Budget-Aware Uncertainty for Radiotherapy Segmentation QA Using nnU-Net

Accurate delineation of the Clinical Target Volume (CTV) is essential for radiotherapy planning, yet remains time-consuming and difficult to assess, especially for complex treatments such as Total Marrow and Lymph Node Irradiation (TMLI). While deep learning-based auto-segmentation can reduce workload, safe clinical deployment requires reliable cues indicating where models may be wrong. In this work, we propose a budget-aware uncertainty-driven quality assurance (QA) framework built on nnU-Net, combining uncertainty quantification and post-hoc calibration to produce voxel-wise uncertainty maps (based on predictive entropy) that can guide targeted manual review. We compare temperature scaling (TS), deep ensembles (DE), checkpoint ensembles (CE), and test-time augmentation (TTA), evaluated both individually and in combination on TMLI as a representative use case. Reliability is assessed through ROI-masked calibration metrics and uncertainty--error alignment under realistic revision constraints, summarized as AUC over the top 0-5% most uncertain voxels. Across configurations, segmentation accuracy remains stable, whereas TS substantially improves calibration. Uncertainty-error alignment improves most with calibrated checkpoint-based inference, leading to uncertainty maps that highlight more consistently regions requiring manual edits. Overall, integrating calibration with efficient ensembling seems a promising strategy to implement a budget-aware QA workflow for radiotherapy segmentation.

C-ReD: A Comprehensive Chinese Benchmark for AI-Generated Text Detection Derived from Real-World Prompts

Recently, large language models (LLMs) are capable of generating highly fluent textual content. While they offer significant convenience to humans, they also introduce various risks, like phishing and academic dishonesty. Numerous research efforts have been dedicated to developing algorithms for detecting AI-generated text and constructing relevant datasets. However, in the domain of Chinese corpora, challenges remain, including limited model diversity and data homogeneity. To address these issues, we propose C-ReD: a comprehensive Chinese Real-prompt AI-generated Detection benchmark. Experiments demonstrate that C-ReD not only enables reliable in-domain detection but also supports strong generalization to unseen LLMs and external Chinese datasets-addressing critical gaps in model diversity, domain coverage, and prompt realism that have limited prior Chinese detection benchmarks. We release our resources at https://github.com/HeraldofLight/C-ReD.

A Mechanistic Analysis of Looped Reasoning Language Models

Reasoning has become a central capability in large language models. Recent research has shown that reasoning performance can be improved by looping an LLM's layers in the latent dimension, resulting in looped reasoning language models. Despite promising results, few works have investigated how their internal dynamics differ from those of standard feedforward models. In this paper, we conduct a mechanistic analysis of the latent states in looped language models, focusing in particular on how the stages of inference observed in feedforward models compare to those observed in looped ones. To this end, we analyze cyclic recurrence and show that for many of the studied models each layer in the cycle converges to a distinct fixed point; consequently, the recurrent block follows a consistent cyclic trajectory in the latent space. We provide evidence that as these fixed points are reached, attention-head behavior stabilizes, leading to constant behavior across recurrences. Empirically, we discover that recurrent blocks learn stages of inference that closely mirror those of feedforward models, repeating these stages in depth with each iteration. We study how recurrent block size, input injection, and normalization influence the emergence and stability of these cyclic fixed points. We believe these findings help translate mechanistic insights into practical guidance for architectural design.

ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection

Tool-augmented Large Language Model (LLM) agents have demonstrated impressive capabilities in automating complex, multi-step real-world tasks, yet remain vulnerable to indirect prompt injection. Adversaries exploit this weakness by embedding malicious instructions within tool-returned content, which agents directly incorporate into their conversation history as trusted observations. This vulnerability manifests across three primary attack channels: web and local content injection, MCP server injection, and skill file injection. To address these vulnerabilities, we introduce \textsc{ClawGuard}, a novel runtime security framework that enforces a user-confirmed rule set at every tool-call boundary, transforming unreliable alignment-dependent defense into a deterministic, auditable mechanism that intercepts adversarial tool calls before any real-world effect is produced. By automatically deriving task-specific access constraints from the user's stated objective prior to any external tool invocation, \textsc{ClawGuard} blocks all three injection pathways without model modification or infrastructure change. Experiments across five state-of-the-art language models on AgentDojo, SkillInject, and MCPSafeBench demonstrate that \textsc{ClawGuard} achieves robust protection against indirect prompt injection without compromising agent utility. This work establishes deterministic tool-call boundary enforcement as an effective defense mechanism for secure agentic AI systems, requiring neither safety-specific fine-tuning nor architectural modification. Code is publicly available at https://github.com/Claw-Guard/ClawGuard.

GenTac: Generative Modeling and Forecasting of Soccer Tactics

Modeling open-play soccer tactics is a formidable challenge due to the stochastic, multi-agent nature of the game. Existing computational approaches typically produce single, deterministic trajectory forecasts or focus on highly structured set-pieces, fundamentally failing to capture the inherent variance and branching possibilities of real-world match evolution. Here, we introduce GenTac, a diffusion-based generative framework that conceptualizes soccer tactics as a stochastic process over continuous multi-player trajectories and discrete semantic events. By learning the underlying distribution of player movements from historical tracking data, GenTac samples diverse, plausible, long-horizon future trajectories. The framework supports rich contextual conditioning, including opponent behavior, specific team or league playing styles, and strategic objectives, while grounding continuous spatial dynamics into a 15-class tactical event space. Extensive evaluations on our proposed benchmark, TacBench, demonstrate four key capabilities: (1) GenTac achieves high geometric accuracy while strictly preserving the collective structural consistency of the team; (2) it accurately simulates stylistic nuances, distinguishing between specific teams (e.g., Auckland FC) and leagues (e.g., A-League versus German leagues); (3) it enables controllable counterfactual simulations, demonstrably altering spatial control and expected threat metrics based on offensive or defensive guidance; and (4) it reliably anticipates future tactical outcomes directly from generated rollouts. Finally, we demonstrate that GenTac can be successfully trained to generalize to other dynamic team sports, including basketball, American football, and ice hockey.

ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents

GUI agents drive applications through their visual interfaces instead of programmatic APIs, interacting with arbitrary software via taps, swipes, and keystrokes, reaching a long tail of applications that CLI-based agents cannot. Yet progress in this area is bottlenecked less by modeling capacity than by the absence of a coherent full-stack infrastructure: online RL training suffers from environment instability and closed pipelines, evaluation protocols drift silently across works, and trained agents rarely reach real users on real devices. We present \textbf{ClawGUI}, an open-source framework addressing these three gaps within a single harness. \textbf{ClawGUI-RL} provides the first open-source GUI agent RL infrastructure with validated support for both parallel virtual environments and real physical devices, integrating GiGPO with a Process Reward Model for dense step-level supervision. \textbf{ClawGUI-Eval} enforces a fully standardized evaluation pipeline across 6 benchmarks and 11+ models, achieving 95.8\% reproduction against official baselines. \textbf{ClawGUI-Agent} brings trained agents to Android, HarmonyOS, and iOS through 12+ chat platforms with hybrid CLI-GUI control and persistent personalized memory. Trained end to end within this pipeline, \textbf{ClawGUI-2B} achieves 17.1\% Success Rate on MobileWorld GUI-Only, outperforming the same-scale MAI-UI-2B baseline by 6.0\%.

General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks

Contemporary large language models (LLMs) have demonstrated remarkable reasoning capabilities, particularly in specialized domains like mathematics and physics. However, their ability to generalize these reasoning skills to more general and broader contexts--often termed general reasoning--remains under-explored. Unlike domain-specific reasoning, general reasoning relies less on expert knowledge but still presents formidable reasoning challenges, such as complex constraints, nested logical branches, and semantic interference. To address this gap, we introduce General365, a benchmark specifically designed to assess general reasoning in LLMs. By restricting background knowledge to a K-12 level, General365 explicitly decouples reasoning from specialized expertise. The benchmark comprises 365 seed problems and 1,095 variant problems across eight categories, ensuring both high difficulty and diversity. Evaluations across 26 leading LLMs reveal that even the top-performing model achieves only 62.8% accuracy, in stark contrast to the near-perfect performances of LLMs in math and physics benchmarks. These results suggest that the reasoning abilities of current LLMs are heavily domain-dependent, leaving significant room for improvement in broader applications. We envision General365 as a catalyst for advancing LLM reasoning beyond domain-specific tasks toward robust, general-purpose real-world scenarios. Code, Dataset, and Leaderboard: https://general365.github.io

AI Models

JANGQ-AI/MiniMax-M2.7-JANGTQ


license: other license_name: minimax-m2.7-non-commercial license_link: LICENSE library_name: mlx tags:

  • mlx
  • jang
  • jangtq
  • minimax
  • moe
  • apple-silicon
  • 2bit pipeline_tag: text-generation base_model: MiniMaxAI/MiniMax-M2.7 base_model_relation: quantized

๐Ÿ”ง 2026-04-14 ยท chat_template.jinja fix โ€” re-download if cached

Earlier versions of this repo shipped a chat_template that unconditionally forced <think> reasoning mode, ignoring enable_thinking=False. Synced with JANG_2L: the new template respects the flag, so callers can now skip reasoning for fast direct answers.

If you downloaded this model before 2026-04-14, please re-download chat_template.jinja:

huggingface-cli download JANGQ-AI/MiniMax-M2.7-JANGTQ chat_template.jinja --local-dir /path/to/your/local/copy

Or pass tools-only prompts (tool calling works regardless). Model weights are unchanged.

<p align="center"> <img src="mlx-studio-logo.png" alt="MLX Studio" width="400"/> </p> <p align="center"> <img src="jangq-logo.png" alt="JANGQ" width="200"/> </p> <div align="center">

MiniMax-M2.7 JANGTQ

MiniMax M2.7 228B MoE โ€” 2.15-bit codebook + Hadamard, 56.5 GB

The smallest, highest-quality MiniMax M2.7 on Apple Silicon.

</div>

โš ๏ธ Recommended: Run in MLX Studio for the best experience. MLX Studio bundles the JANGTQ runtime, handles thinking mode, and uses the custom Metal kernels this model needs. Stock mlx_lm.load() will NOT load this model โ€” see usage instructions below.

Follow development on Twitter: @jangq_ai


What is JANGTQ?

JANGTQ (JANG TurboQuant) is the most-compressed, highest-quality JANG quantization format. Routed expert weights stay in a compact codebook + Hadamard-rotated form at runtime โ€” no decompression to affine โ€” and the matmul path uses custom Metal kernels that read packed uint32 weights, look up centroids in a 4-entry codebook, and accumulate dot products against a Hadamard-rotated input (QuIP# "rotate-input-once" math).

Result: smaller than affine 2-bit, higher quality than affine 2-bit, runs at 89% of affine 2-bit speed on Apple Silicon.

| | JANG_2L (affine) | JANGTQ | ฮ” | |---|---|---|---| | Disk size | ~63 GB | 56.5 GB | โˆ’10% | | GPU memory | ~62.6 GB | 56.5 GB | โˆ’10% | | Avg bits/param | 2.10 | ~2.15 | +0.05 | | MMLU (200q) | 88% | 91.5% | +3.5 pp | | Decode speed (M3 Ultra) | 48-50 tok/s | 44.3 tok/s | ~89% of affine |

JANGTQ trades ~10% speed for ~10% disk savings AND a quality improvement. The 2-bit codebook learned via Lloyd-Max is strictly more expressive than uniform 2-bit affine for the Gaussian-ish distribution of Hadamard-rotated weights, so the same bit budget reproduces the original weight matrix more faithfully.


MMLU Benchmark (200 questions, 10 subjects, reasoning ON)

Overall: 183/200 = 91.5%

Tested 2026-04-13 on Mac Studio M3 Ultra. Reasoning enabled (MiniMax M2.7 is an always-reasoning model); <think>โ€ฆ</think> stripped before scoring.

| Subject | JANGTQ | JANG_2L (affine) | JANG_3L/4M | |---|---|---|---| | astronomy | 20/20 (100%) | โ€” | โ€” | | high_school_biology | 20/20 (100%) | โ€” | โ€” | | abstract_algebra | 19/20 (95%) | โ€” | โ€” | | college_computer_science | 19/20 (95%) | โ€” | โ€” | | high_school_mathematics | 19/20 (95%) | โ€” | โ€” | | college_physics | 18/20 (90%) | โ€” | โ€” | | high_school_chemistry | 18/20 (90%) | โ€” | โ€” | | anatomy | 17/20 (85%) | โ€” | โ€” | | world_religions | 17/20 (85%) | โ€” | โ€” | | logical_fallacies | 16/20 (80%) | โ€” | โ€” | | Total | 183/200 = 91.5% | ~88% | ~95.5% |

JANGTQ sits cleanly between affine JANG_2L (88%) and the larger JANG_3L/4M (95.5%) โ€” capturing most of the quality of the 3L/4M profiles at ~55-60% of their disk footprint.

Speed Benchmarks (Mac Studio M3 Ultra)

| Prompt / max_tok | observed tok | tok/s | |---|---|---| | "Capital of France?" / 50 | 50 / 50 | 35.6 | | "Capital of France?" / 150 | 66 / 150 | 37.5 | | "Count 1-30" / 150 | 150 / 150 | 42.2 | | "Photosynthesis 5 sent" / 300 | 300 / 300 | 44.5 | | "Poem + 17ร—23" / 300 | 296 / 300 | 44.0 | | MMLU average (200q, reasoning on) | โ€” | 41.9 |

Steady-state (300-tok and longer): ~44.3 tok/s. Short prompts appear slower due to fixed prefill amortization.


Important Settings

MiniMax M2.7 is an always-reasoning model. The chat template unconditionally opens <think>\n at each assistant turn.

| Setting | Value | Notes | |---------|-------|-------| | Temperature | 1.0 | REQUIRED โ€” temp=0 can cause thinking loops | | Top P | 0.95 | | | Top K | 40 | | | Repetition Penalty | 1.1 | Optional, helps prevent loops | | max_tokens | โ‰ฅ 8192 | Give reasoning room to converge |

Strip <think>โ€ฆ</think> from the response before using the final answer.


Model Details

| Metric | Value | |---|---| | Source | MiniMaxAI/MiniMax-M2.7 (FP8 E4M3) | | Architecture | MoE (256 experts, top-8 active), standard Q/K/V attention, partial RoPE | | Total parameters | 228.7 B | | Active per token | ~1.4 B | | Profile | JANGTQ | | Format | JANGTQ (codebook+Hadamard) โ€” weight_format: mxtq in jang_config.json | | Avg bits/param | ~2.15 | | Disk | 56.55 GB | | GPU active (loaded) | 56.50 GB | | GPU peak (decoding) | 57-58 GB | | Load time | ~10 s | | Context | 192 K tokens | | Chat template | Always-reasoning (<think>\n opened at assistant start) |

JANGTQ Bit Allocation

| Component | Bits | Format | Why | |---|---|---|---| | Routed expert MLP (gate/up/down) โ€” 98% of params | 2 | JANGTQ codebook + Hadamard | Sparsely activated (8 of 256 per token); the learned codebook on Hadamard-rotated rows reproduces the distribution better than uniform 2-bit affine | | Attention (Q/K/V/O) | 8 | affine (nn.QuantizedLinear, group_size=64) | Runs on every token; quality-critical | | Shared expert | 8 | affine | Runs on every token | | Embed tokens / LM head | 8 | affine | Quality-critical input/output projections | | Router gate | fp16 | unquantized nn.Linear | Routing precision matters; ~0.8M params, negligible size | | RMSNorms / RoPE / biases | fp16 | unquantized | Already tiny |

The routed experts are the 98% of parameters and the natural compression target. JANGTQ pushes them to 2-bit with a codebook-learned quantizer and a random Hadamard rotation. Everything else stays at 8-bit affine so the quality- critical hot path (attention + embed + shared expert) runs at full precision.


Usage

This model requires the jang-tools loader โ€” stock mlx_lm.load() does NOT recognize weight_format: mxtq and will reject the model. The loader applies Metal kernel monkey-patches at load time (fused gate+up+SwiGLU, gather TQ, multi-block Hadamard, router compile, QKV fusion, thread-tiling OPT=10/20).

pip install jang-tools
# Or from source: git clone https://github.com/JANGQ-AI/jang-tools
from huggingface_hub import snapshot_download
from jang_tools.load_jangtq import load_jangtq_model
from mlx_lm import generate

model_path = snapshot_download("JANGQ-AI/MiniMax-M2.7-JANGTQ")
model, tokenizer = load_jangtq_model(model_path)

messages = [{"role": "user", "content": "Explain photosynthesis in 5 sentences."}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
out = generate(model, tokenizer, prompt, max_tokens=600, verbose=True)

# Strip reasoning to get the final answer
if "</think>" in out:
    out = out.split("</think>")[-1].strip()
print(out)

On first load you'll see log lines like:

Loading JANGTQ: MiniMax-M2.7-JANGTQ
  seed=42, bits_map={'attention': 8, ..., 'routed_expert': 2, ...}
  61 shards
  TQ groups: 47616, regular: 1123
  Replaced 186 modules with TurboQuantLinear
  Patched SwitchGLU class for fused gate+up (62 TQ instances)
  P15 mx.compile(router) applied to 1 MoE class(es)
  P18 QKV fusion: 1 class(es), 62 instances
  Done

That's all four classes of optimizations (P3/P15/P17/P18) engaging. Expected decode: ~44 tok/s on M3 Ultra, ~35-40 tok/s on M4 Max, ~25-30 tok/s on M4 Pro.

Minimum Hardware

| GPU | Min RAM | Notes | |---|---|---| | M3 Ultra / M2 Ultra | 96 GB | Tested on 256 GB, 44 tok/s | | M4 Max | 96 GB | Expected ~35-40 tok/s | | M4 Pro | 64 GB | Very tight; expect ~25-30 tok/s | | M3 Max / M2 Max | 96 GB | Expected ~30-35 tok/s |

56.5 GB of GPU memory is needed just for the weights; add 2-5 GB for KV cache and intermediate activations, plus enough system memory for the OS + other processes.

Why JANG for MiniMax

Standard MLX uniform quantization on MiniMax produces completely broken output at every bit level โ€” MMLU drops to ~25% (random guessing) because the MoE router becomes unreliable. JANG's mixed-precision approach (attention + router at full precision, routed experts at 2-bit) is the only working quantized MiniMax on Apple Silicon.

JANGTQ takes this one step further by using a learned codebook for the 2-bit expert weights. For MiniMax M2.5, JANG_2L (affine) scored 74% MMLU vs MLX's 25%. For MiniMax M2.7, JANGTQ scores 91.5% โ€” the highest-quality sub-60-GB MiniMax quant on any runtime.


Compression Math

Quantization (offline, per weight matrix):
  w_rot[r, i]  = (H โŠ™ signs * w^T)[r, i]           # randomized Hadamard rotation
  norms[r]     = ||w_rot[r, :]||โ‚‚
  packed[r, i] = argmin_c ||w_rot[r, i]/norms[r] - codebook[c]||   # Lloyd-Max 2-bit

Inference (runtime):
  x_rot   = H โŠ™ (signs * x)                         # O(d log d) rotation
  y[b, r] = norms[r] ยท ฮฃแตข x_rot[b, i] ยท codebook[unpack(packed[r, i])]

The Hadamard rotation flattens the heavy tail of the weight distribution, so a 4-entry codebook (2-bit) captures it with minimal error. The rotation is symmetric (H @ H = I), so rotating the input once at runtime is mathematically equivalent to rotating every weight once at quantization time.

Credit: QuIP# for the rotate-input-once insight.


Known Behaviors / Settings

  • Always-reasoning: chat template opens <think>\n at assistant start. Give it max_tokens โ‰ฅ 8192 in benchmarks.
  • Stop token: single EOS [e~[ = id 200020. mlx_lm reads this correctly from generation_config.json.
  • Temperature 1.0 required: greedy/temp=0 can cause the reasoning to get stuck in a loop. Top-p 0.95 + top-k 40 recommended.
  • GPU RAM: 56.5 GB base + KV cache grows with conversation length. Budget 60-65 GB for typical use, more for very long contexts.

Created by Jinho Jang (eric@jangq.ai) โ€” part of the JANG collection.

Base model: MiniMaxAI/MiniMax-M2.7. Quantization method: JANGTQ (codebook + randomized Hadamard, see math above). License: follows the upstream MiniMax open license.

Author: JANGQ-AI

Likes: 23

Downloads: 0

Tags: mlx, safetensors, minimax_m2, jang, jangtq, minimax, moe, apple-silicon, 2bit, text-generation, conversational, custom_code, arxiv:2402.04396, base_model:MiniMaxAI/MiniMax-M2.7, base_model:quantized:MiniMaxAI/MiniMax-M2.7, license:other, region:us

Motif-Technologies/Motif-Video-2B


license: apache-2.0 language:

  • en tags:
  • text-to-video
  • image-to-video
  • video-generation
  • diffusion-transformer pipeline_tag: text-to-video library_name: diffusers

<!-- Cover banner โ€” replace with the qualitative teaser strip used as Figure 1 in the technical report (multi-prompt frame grid). Wan / HunyuanVideo style: full-width image directly under the title. --> <p align="center"> <img src="assets/banner.png" width="100%" alt="Motif-Video 2B teaser"/> </p> <p align="center"> <h1 align="center">Motif-Video 2B</h1> </p> <p align="center"> <b>A micro-budget text-to-video diffusion transformer from Motif Technologies</b> </p> <p align="center"> ๐Ÿ“‘ <a href="">Technical Report</a> &nbsp;|&nbsp; ๐Ÿค— <a href="">Hugging Face</a> &nbsp;|&nbsp; ๐ŸŒ <a href="https://motiftech.io/videoshowcase">Project Page</a> (not updated) </p>

๐Ÿ”ฅ News

  • [2026-04-14] We release Motif-Video 2B, our 2B-parameter text-to-video and image-to-video diffusion transformer, together with the full technical report.

๐Ÿ“– Introduction

Training strong video generation models usually requires massive datasets, large parameter counts, and substantial compute. Motif-Video 2B asks whether competitive text-to-video quality is reachable at a much smaller budget โ€” fewer than 10M training clips and under 100,000 H200 GPU hours โ€” and shows that the answer is yes, provided the model design explicitly separates objectives that scaling would otherwise leave entangled.

Our central observation is that prompt alignment, temporal consistency, and fine-detail recovery interfere with one another when handled through the same pathway. Motif-Video 2B addresses this objective interference architecturally rather than relying on scale alone, through two contributions:

  • Shared Cross-Attention. A residual cross-attention mechanism that reuses self-attention K/V weights to stabilize textโ€“video alignment under long-context token sparsity, where standard joint attention dilutes text influence as the video token sequence grows.
  • Three-stage DDT-style backbone. 12 dual-stream + 16 single-stream + 8 DDT decoder layers, separating early modality fusion, joint representation learning, and high-frequency detail reconstruction into dedicated components. Per-block attention analysis shows that the DDT decoder spontaneously develops inter-frame attention structure absent from the encoder layers.

These are paired with a micro-budget training recipe combining TREAD token routing and early-phase REPA with a frozen V-JEPA teacher โ€” to our knowledge, the first time this combination has been applied to text-to-video training.

On VBench, Motif-Video 2B reaches 83.76%, the highest Total Score among open-source models we evaluate, surpassing Wan2.1-14B at 7ร— fewer parameters and roughly an order of magnitude less training data.

<!-- Architecture figure โ€” replace with Figure 2 from the technical report (the three-stage backbone + Shared Cross-Attention diagram). --> <p align="center"> <img src="assets/architecture.png" width="90%" alt="Motif-Video 2B architecture"/> </p>

โœจ Highlights

  • Two tasks, one set of weights. A single checkpoint handles both text-to-video (T2V) and image-to-video (I2V) generation, trained jointly without a learnable task-type embedding.
  • Up to 720p, 121 frames. The final model generates 720p video at 121 frames under the standard rectified flow-matching sampler.
  • Architectural specialization over brute-force scale. Three-stage backbone with role-separated dual-stream / single-stream / DDT decoder layers.
  • Shared Cross-Attention. Stabilizes text alignment under long video-token sequences by grounding cross-attention K/V in the self-attention manifold.
  • Micro-budget recipe. TREAD token routing (โ‰ˆ27% per-step FLOP reduction) + early-phase REPA with V-JEPA teacher + offline bucket-balanced sampler (โ‰ˆ90% data utilization, up from โ‰ˆ20% baseline).
  • Open and reproducible. Trained on ~64ร—H200 GPUs with FSDP2, full curriculum and recipe documented in the technical report.

๐Ÿ—๏ธ Architecture

Motif-Video 2B is a flow-matching diffusion transformer organized around a single principle: each component is assigned a well-defined responsibility, and components with conflicting objectives are not asked to share capacity.

| Component | Choice | |---|---| | Text encoder | T5Gemma2 (encoderโ€“decoder, UL2-adapted Gemma 3) | | Video tokenizer | Wan2.1 VAE (8ร—8 spatial, 4ร— temporal compression), 2ร—2ร—1 patchify | | Backbone | 12 dual-stream + 16 single-stream + 8 DDT decoder layers | | Hidden dim / heads | 1536 / 12 heads ร— 128 | | Normalization | QK-normalization throughout | | Position encoding | RoPE | | Cross-attention | Shared Cross-Attention in the single-stream stage | | Objective | Rectified flow matching (velocity prediction) | | I2V conditioning | First-frame latent + SigLIP image embeddings, with timestep-aware blur |

A high-level walkthrough of the role separation:

  1. Dual-stream stage (12 layers). Text and video tokens are processed through separate self-attention pathways, exchanging information via cross-attention. This prevents premature feature entanglement before either modality has formed coherent representations.
  2. Single-stream stage (16 layers). Text and video tokens attend freely in a joint sequence. Shared Cross-Attention is attached here to repair the text-attention dilution that emerges as the video token sequence grows.
  3. DDT decoder (8 layers). A dedicated velocity decoder atop the 28-layer encoder, freeing the encoder from high-frequency detail reconstruction. Per-block attention analysis shows that the DDT decoder develops inter-frame attention structure that single-stream layers do not.

For the full derivation of why Shared Cross-Attention shares K/V but not Q, and why this is necessary in addition to standard zero-init of W_O, see Section 3.3 of the technical report.

<!-- Optional: insert Figure 3 (attention heatmaps across the three stages) here as a secondary architecture figure. It is the strongest visual evidence for the role-separation argument. -->

๐Ÿš€ Quickstart / Usage

Requirements

  • Python 3.10+
  • CUDA-capable GPU with 24GB+ VRAM (e.g., A100, H100, RTX 4090)
pip install "diffusers>=0.35.2" "transformers>=5.0.0" torch accelerate ftfy einops sentencepiece regex Pillow

Text-to-Video (T2V)

import torch
from diffusers import AdaptiveProjectedGuidance, DiffusionPipeline
from diffusers.utils import export_to_video

guider = AdaptiveProjectedGuidance(
    guidance_scale=8.0,
    adaptive_projected_guidance_rescale=12.0,
    adaptive_projected_guidance_momentum=0.1,
    use_original_formulation=True,
)

pipe = DiffusionPipeline.from_pretrained(
    "Motif-Technologies/Motif-Video-2B",
    custom_pipeline="pipeline_motif_video",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    guider=guider,
)
pipe = pipe.to("cuda")

output = pipe(
    prompt="A golden retriever running through a sunlit meadow, slow motion, cinematic lighting",
    height=736,
    width=1280,
    num_frames=121,
    num_inference_steps=50,
)

export_to_video(output.frames[0], "output.mp4", fps=24)

Image-to-Video (I2V)

import torch
from diffusers import DiffusionPipeline
from diffusers import AdaptiveProjectedGuidance
from diffusers.utils import export_to_video
from PIL import Image

guider = AdaptiveProjectedGuidance(
    guidance_scale=8.0,
    adaptive_projected_guidance_rescale=12.0,
    adaptive_projected_guidance_momentum=0.1,
    use_original_formulation=True,
)

pipe = DiffusionPipeline.from_pretrained(
    "Motif-Technologies/Motif-Video-2B",
    custom_pipeline="pipeline_motif_video",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    guider=guider,
)
pipe = pipe.to("cuda")

image = Image.open("input.png").convert("RGB")

output = pipe(
    prompt="The subject begins to move naturally",
    image=image,
    height=736,
    width=1280,
    num_frames=121,
    num_inference_steps=50,
)

export_to_video(output.frames[0], "output.mp4", fps=24)

CLI Inference

# Text-to-Video
python inference.py \
  --prompt "A time-lapse of a flower blooming in a dark room, dramatic lighting" \
  --output t2v_output.mp4

# Image-to-Video
python inference.py \
  --image input.png \
  --prompt "The camera slowly pans around the subject" \
  --output i2v_output.mp4

See inference.py for all available options (--help).

Recommended Settings

| Parameter | Default | Notes | |---|---|---| | Resolution | 1280x736 | 720p, best quality | | Frames | 121 | ~5 seconds at 24fps | | Guidance scale | 8.0 | | | Scheduler shift | 15.0 | Pre-configured in scheduler config | | Inference steps | 50 | | | dtype | bfloat16 | Recommended for H100/A100 |


๐Ÿ“Š Performance

VBench

Motif-Video 2B achieves the highest Total Score among open-source models we evaluate.

| Model | Params | Total | Quality | Semantic | |---|---|---|---|---| | Wan2.2-T2V (prompt-opt.) | A14B | 84.23 | 85.42 | 79.50 | | Motif-Video 2B (Ours) | 2B | 83.76 | 84.59 | 80.44 | | SANA-Video | 2B | 83.71 | 84.35 | 81.35 | | Wan2.1-T2V | 14B | 83.69 | 85.59 | 76.11 | | OpenSora 2.0 (T2I2V) | 11B | 83.60 | 84.40 | 80.30 | | Wan2.1-T2V | 1.3B | 83.31 | 85.23 | 75.65 | | HunyuanVideo | 13B | 83.24 | 85.09 | 75.82 | | CogVideoX1.5-5B (prompt-opt.) | 5B | 82.17 | 82.78 | 79.76 | | Step-Video-T2V | 30B | 81.83 | 84.46 | 71.28 | | LTX-Video | 2B | 80.00 | 82.30 | 70.79 |

Notable per-dimension highlights for Motif-Video 2B (open-source):

  • Spatial Relationship: 83.02% โ€” best among open-source models
  • Semantic Score: 80.44% โ€” highest among open-source models reporting per-dimension results
  • Object Class: 92.93%, Multiple Objects: 77.29%, Imaging Quality: 70.50% โ€” second-best in their categories

The full 16-dimension breakdown is in Table 3 of the technical report.

A note on VBench vs. perceptual quality. Motif-Video 2B leads on VBench Total Score, but in our internal side-by-side comparisons against Wan2.1-T2V-14B we observe a perceptual gap in favor of the larger model on temporal stability and fine human anatomy. We discuss the sources of this gap (uniform dimension weighting, near-correct semantic credit) in Section 7 of the report. We report the gap explicitly rather than smoothing it over.

Human evaluation

In a blind pairwise study against six contemporaneous open-source baselines (SANA-Video, LTX-Video 2, Wan2.1-14B, Wan2.1-1.3B, Wan2.2-5B, CogVideoX-5B) on 40 LLM-generated prompts, Motif-Video 2B is preferred over both SANA-Video (similar parameter count) and Wan2.1-1.3B (similar parameter count, larger training corpus) on prompt-following and video-fidelity axes. Wan2.1-14B remains the preferred model overall, consistent with its 7ร— larger parameter count and substantially larger training data.


๐ŸŽฌ Showcase

<!-- Insert the qualitative grids from the technical report here: - Figure 1 / Figure 12: T2V multi-prompt frame strips - Figure 13: I2V example (input image + generated frames) Use full-width or 2-column layout, matching Wan2.1's "Showcase" section. -->

Text-to-Video

<p align="center"> <img src="assets/showcase_t2v.png" width="100%" alt="Motif-Video 2B T2V samples"/> </p>

Image-to-Video

<p align="center"> <img src="assets/showcase_i2v.png" width="100%" alt="Motif-Video 2B I2V samples"/> </p>

โš ๏ธ Limitations

We report limitations as the boundary conditions under which the design decisions in this report should be interpreted, not as caveats.

  • Micro-scale semantic distortion. Motif-Video 2B occasionally produces sub-object-level artifacts that leave the category label intact but break perceptual plausibility โ€” distorted hands on close-up human subjects, degraded body structure under high-displacement motion, and attribute leakage between visually similar co-present subjects. We attribute these primarily to data coverage rather than backbone design.
  • Temporal failures. Three distinct modes that frame-level metrics do not surface: (i) physically implausible liquid / cloth / collision dynamics, (ii) coherence loss under high scene complexity (multi-agent crowds), and (iii) unintended mid-clip scene transitions in long sequences.
  • Recipe components are evaluated jointly, not in isolation. We do not present per-component ablations for Shared Cross-Attention, the DDT decoder, REPA phasing, or TREAD routing at full scale. Readers should interpret our results as evidence that the composed recipe works at 2B, not as a marginal-contribution claim about any single component.

We view temporal stability and data coverage โ€” not architectural depth โ€” as the primary remaining ceilings on this model. Both are the most natural axes for a future iteration that the current architecture is built to absorb.


๐Ÿ“š Citation

If you find Motif-Video 2B useful in your research, please cite:

@techreport{motifvideo2b2026,
  title  = {Motif-Video 2B: Technical Report},
  author = {Motif Technologies},
  year   = {2026},
  institution = {Motif Technologies},
  url    = {}
}

๐Ÿ™ Acknowledgements

We build on a number of excellent open-source projects, including the Wan2.1 VAE [Wan Team, 2025], T5Gemma / Gemma 3 [Google], TREAD [Krause et al., 2025], REPA with the V-JEPA family of visual encoders [Bardes et al.], DDT [Wang et al.], and the broader diffusers and Accelerate ecosystems. Compute was provisioned on Microsoft Azure and orchestrated with SkyPilot on Kubernetes.


๐Ÿ“„ License

<!-- TODO: confirm final license โ€” apache-2.0 placeholder above. -->

This model is released under the Apache 2.0 License. See LICENSE for details.

Author: Motif-Technologies

Likes: 9

Downloads: 0

Tags: diffusers, safetensors, text-to-video, image-to-video, video-generation, diffusion-transformer, en, license:apache-2.0, diffusers:MotifVideoPipeline, region:us

unsloth/ERNIE-Image-Turbo-GGUF


base_model: baidu/ERNIE-Image-Turbo license: apache-2.0 pipeline_tag: text-to-image library_name: ggml tags:

  • text-to-image
  • gguf
  • unsloth widget:
  • text: LLM teaching sloth output: url: assets/Ernie-Image-Turbo-GGUF.png

This is a GGUF quantized version of ERNIE-Image-Turbo. <br> unsloth/ERNIE-Image-Turbo-GGUF uses Unsloth Dynamic 2.0 methodology for SOTA performance.

  • Important layers are upcasted to higher precision.
  • Uses tooling from ComfyUI-GGUF by city96.
<div> <div style="display: flex; gap: 5px; align-items: center; "> <a href="https://github.com/unslothai/unsloth/"> <img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="133"> </a> <a href="https://discord.gg/unsloth"> <img src="https://github.com/unslothai/unsloth/raw/main/images/Discord%20button.png" width="173"> </a> <a href="https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs"> <img src="https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/documentation%20green%20button.png" width="143"> </a> </div> </div>

ERNIE-Image-Turbo

<p align="center"> <a href="https://huggingface.co/Baidu/ERNIE-Image">๐Ÿค— ERNIE-Image</a> &nbsp;|&nbsp; <a href="https://huggingface.co/Baidu/ERNIE-Image-Turbo">๐Ÿค— ERNIE-Image-Turbo</a> &nbsp;|&nbsp; <a href="https://huggingface.co/spaces/baidu/ERNIE-Image">๐Ÿ–ฅ๏ธ Huggingface Demo</a> &nbsp;|&nbsp; <br/> <a href="https://aistudio.baidu.com/ernieimage">๐Ÿ–ฅ๏ธ AI Studio Demo</a> &nbsp;|&nbsp; <a href="https://yiyan.baidu.com/blog/posts/ernie-image">๐Ÿ“– Blog</a> &nbsp;|&nbsp; <a href="https://ernieimageprompt.com/">๐Ÿ–ผ๏ธ Art Gallery</a> <br/> <a href="https://github.com/baidu/ERNIE-Image/blob/main/assets/contacts/WeChat_small.jpg">๐Ÿ’ฌ WeChat(ๅพฎไฟก)</a> &nbsp;|&nbsp; <a href="https://discord.gg/ByUTbjfG5k">๐Ÿซจ Discord</a> &nbsp;|&nbsp; <a href="https://x.com/ErnieforDevs">๐Ÿท๏ธ X</a> </p>

ERNIE-Image-Turbo is an open text-to-image generation model developed by the ERNIE-Image team at Baidu. It is the distilled release of ERNIE-Image, built on the same single-stream Diffusion Transformer (DiT) family and designed for fast generation with strong fidelity in only 8 inference steps. The model retains strong controllability in practical generation scenarios where accurate content realization matters as much as aesthetics. In particular, ERNIE-Image-Turbo remains strong on complex instruction following, text rendering, and structured image generation, making it well suited for posters, comics, multi-panel layouts, and other content creation tasks that require both visual quality and efficiency. It also supports a broad range of visual styles, including realistic photography, design-oriented imagery, and stylized aesthetic outputs.

<p align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/5f8d780e5d083370c711f575/QRt1mPSU9SCkcxxFWQje2.jpeg" alt="ERNIE-Image Mosaic" width="100%"> </p>

Highlights:

  • Fast and efficient: As the distilled checkpoint of ERNIE-Image, ERNIE-Image-Turbo delivers strong generation quality with only 8 inference steps, making it suitable for latency-sensitive applications.
  • Text rendering: ERNIE-Image-Turbo performs well on dense, long-form, and layout-sensitive text, making it a strong choice for posters, infographics, UI-like images, and other text-heavy visual content.
  • Instruction following: The model is able to follow complex prompts involving multiple objects, detailed relationships, and knowledge-intensive descriptions with strong reliability.
  • Structured generation: ERNIE-Image-Turbo is effective for structured visual tasks such as posters, comics, storyboards, and multi-panel compositions, where layout and organization are critical.
  • Style coverage: In addition to clean and readable design-oriented outputs, the model also supports realistic photography and distinctive stylized aesthetics, including softer and more cinematic visual tones.
  • Practical deployment: Thanks to its compact size, ERNIE-Image-Turbo can run on consumer GPUs with 24G VRAM, which lowers the barrier for research, downstream use, and model adaptation.

Released Versions

ERNIE-Image: Our SFT model, delivers stronger general-purpose capability and instruction fidelity in typically 50 inference steps.

ERNIE-Image-Turbo: Our Turbo model, optimized by DMD and RL, achieves faster speed and higher aesthetics in only 8 inference steps.

Benchmark

GENEval

| Model | Single Object | Two Object | Counting | Colors | Position | Attribute Binding | Overall | |---|---:|---:|---:|---:|---:|---:|---:| | ERNIE-Image (w/o PE) | 1.0000 | 0.9596 | 0.7781 | 0.9282 | 0.8550 | 0.7925 | 0.8856 | | ERNIE-Image (w/ PE) | 0.9906 | 0.9596 | 0.8187 | 0.8830 | 0.8625 | 0.7225 | 0.8728 | | Qwen-Image | 0.9900 | 0.9200 | 0.8900 | 0.8800 | 0.7600 | 0.7700 | 0.8683 | | ERNIE-Image-Turbo (w/o PE) | 1.0000 | 0.9621 | 0.7906 | 0.9202 | 0.7975 | 0.7300 | 0.8667 | | ERNIE-Image-Turbo (w/ PE) | 0.9938 | 0.9419 | 0.8375 | 0.8351 | 0.7950 | 0.7025 | 0.8510 | | FLUX.2-klein-9B | 0.9313 | 0.9571 | 0.8281 | 0.9149 | 0.7175 | 0.7400 | 0.8481 | | Z-Image | 1.0000 | 0.9400 | 0.7800 | 0.9300 | 0.6200 | 0.7700 | 0.8400 | | Z-Image-Turbo | 1.0000 | 0.9500 | 0.7700 | 0.8900 | 0.6500 | 0.6800 | 0.8233 |

OneIG-EN

| Model | Alignment | Text | Reasoning | Style | Diversity | Overall | |---|---:|---:|---:|---:|---:|---:| | Nano Banana 2.0 | 0.8880 | 0.9440 | 0.3340 | 0.4810 | 0.2450 | 0.5780 | | Seedream 4.5 | 0.8910 | 0.9980 | 0.3500 | 0.4340 | 0.2070 | 0.5760 | | ERNIE-Image (w/ PE) | 0.8678 | 0.9788 | 0.3566 | 0.4309 | 0.2411 | 0.5750 | | Seedream 4.0 | 0.8920 | 0.9830 | 0.3470 | 0.4530 | 0.1910 | 0.5730 | | ERNIE-Image-Turbo (w/ PE) | 0.8676 | 0.9666 | 0.3537 | 0.4191 | 0.2212 | 0.5656 | | ERNIE-Image (w/o PE) | 0.8909 | 0.9668 | 0.2950 | 0.4471 | 0.1687 | 0.5537 | | Z-Image | 0.8810 | 0.9870 | 0.2800 | 0.3870 | 0.1940 | 0.5460 | | Qwen-Image | 0.8820 | 0.8910 | 0.3060 | 0.4180 | 0.1970 | 0.5390 | | ERNIE-Image-Turbo (w/o PE) | 0.8795 | 0.9488 | 0.2913 | 0.4277 | 0.1232 | 0.5341 | | FLUX.2-klein-9B | 0.8871 | 0.8657 | 0.3117 | 0.4417 | 0.1560 | 0.5324 | | Qwen-Image-2512 | 0.8760 | 0.9900 | 0.2920 | 0.3380 | 0.1510 | 0.5300 | | GLM-Image | 0.8050 | 0.9690 | 0.2980 | 0.3530 | 0.2130 | 0.5280 | | Z-Image-Turbo | 0.8400 | 0.9940 | 0.2980 | 0.3680 | 0.1390 | 0.5280 |

OneIG-ZH

| Model | Alignment | Text | Reasoning | Style | Diversity | Overall | |---|---:|---:|---:|---:|---:|---:| | Nano Banana 2.0 | 0.8430 | 0.9830 | 0.3110 | 0.4610 | 0.2360 | 0.5670 | | ERNIE-Image (w/ PE) | 0.8299 | 0.9539 | 0.3056 | 0.4342 | 0.2478 | 0.5543 | | Seedream 4.0 | 0.8360 | 0.9860 | 0.3040 | 0.4430 | 0.2000 | 0.5540 | | Seedream 4.5 | 0.8320 | 0.9860 | 0.3000 | 0.4260 | 0.2130 | 0.5510 | | Qwen-Image | 0.8250 | 0.9630 | 0.2670 | 0.4050 | 0.2790 | 0.5480 | | ERNIE-Image-Turbo (w/ PE) | 0.8258 | 0.9386 | 0.3043 | 0.4208 | 0.2281 | 0.5435 | | Z-Image | 0.7930 | 0.9880 | 0.2660 | 0.3860 | 0.2430 | 0.5350 | | ERNIE-Image (w/o PE) | 0.8421 | 0.8979 | 0.2656 | 0.4212 | 0.1772 | 0.5208 | | Qwen-Image-2512 | 0.8230 | 0.9830 | 0.2720 | 0.3420 | 0.1570 | 0.5150 | | GLM-Image | 0.7380 | 0.9760 | 0.2840 | 0.3350 | 0.2210 | 0.5110 | | Z-Image-Turbo | 0.7820 | 0.9820 | 0.2760 | 0.3610 | 0.1340 | 0.5070 | | ERNIE-Image-Turbo (w/o PE) | 0.8326 | 0.9086 | 0.2580 | 0.4002 | 0.1316 | 0.5062 | | FLUX.2-klein-9B | 0.8201 | 0.4920 | 0.2599 | 0.4166 | 0.1625 | 0.4302 |

LongTextBench

| Model | LongText-Bench-EN | LongText-Bench-ZH | Avg | |---|---:|---:|---:| | Seedream 4.5 | 0.9890 | 0.9873 | 0.9882 | | ERNIE-Image (w/ PE) | 0.9804 | 0.9661 | 0.9733 | | GLM-Image | 0.9524 | 0.9788 | 0.9656 | | ERNIE-Image-Turbo (w/ PE) | 0.9675 | 0.9636 | 0.9655 | | Nano Banana 2.0 | 0.9808 | 0.9491 | 0.9650 | | ERNIE-Image-Turbo (w/o PE) | 0.9602 | 0.9675 | 0.9639 | | ERNIE-Image (w/o PE) | 0.9679 | 0.9594 | 0.9636 | | Qwen-Image-2512 | 0.9561 | 0.9647 | 0.9604 | | Qwen-Image | 0.9430 | 0.9460 | 0.9445 | | Z-Image | 0.9350 | 0.9360 | 0.9355 | | Seedream 4.0 | 0.9214 | 0.9261 | 0.9238 | | Z-Image-Turbo | 0.9170 | 0.9260 | 0.9215 | | FLUX.2-klein-9B | 0.8642 | 0.2183 | 0.5413 |

Quick Start

Recommended Parameters

  • Resolution:
    • 1024x1024
    • 848x1264
    • 1264x848
    • 768x1376
    • 896x1200
    • 1376x768
    • 1200x896
  • Guidance scale: 1.0
  • Inference steps: 8

Diffusers

Install the latest version of diffusers:

pip install git+https://github.com/huggingface/diffusers
import torch
from diffusers import ErnieImagePipeline

pipe = ErnieImagePipeline.from_pretrained(
    "Baidu/ERNIE-Image-Turbo",
    torch_dtype=torch.bfloat16,
).to("cuda")

image = pipe(
    prompt="This is a photograph depicting an urban street scene. Shot at eye level, it shows a covered pedestrian or commercial street. Slightly below the center of the frame, a cyclist rides away from the camera toward the background, appearing as a dark silhouette against backlighting with indistinct details. The ground is paved with regular square tiles, bisected by a prominent tactile paving strip running through the scene, whose raised textures are clearly visible under the light. Light streams in diagonally from the right side of the frame, creating a strong backlight effect with a distinct Tyndall effectโ€”visible light beams illuminating dust or vapor in the air and casting long shadows across the street. Several pedestrians appear on the left side and in the distance, some with their backs to the camera and others walking sideways, all rendered as silhouettes or semi-silhouettes. The overall color palette is warm, dominated by golden yellows and dark browns, evoking the atmosphere of dusk or early morning.",
    height=1264,
    width=848,
    num_inference_steps=8,
    guidance_scale=1.0,
    use_pe=True # use prompt enhancer
).images[0]

image.save("output.png")

SGLang

Install the latest version of sglang:

git clone https://github.com/sgl-project/sglang.git

Start the server:

sglang serve --model-path baidu/ERNIE-Image-Turbo

Send a generation request:

curl -X POST http://localhost:30000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "This is a photograph depicting an urban street scene. Shot at eye level, it shows a covered pedestrian or commercial street. Slightly below the center of the frame, a cyclist rides away from the camera toward the background, appearing as a dark silhouette against backlighting with indistinct details. The ground is paved with regular square tiles, bisected by a prominent tactile paving strip running through the scene, whose raised textures are clearly visible under the light. Light streams in diagonally from the right side of the frame, creating a strong backlight effect with a distinct Tyndall effectโ€”visible light beams illuminating dust or vapor in the air and casting long shadows across the street. Several pedestrians appear on the left side and in the distance, some with their backs to the camera and others walking sideways, all rendered as silhouettes or semi-silhouettes. The overall color palette is warm, dominated by golden yellows and dark browns, evoking the atmosphere of dusk or early morning.",
    "height": 1264,
    "width": 848,
    "num_inference_steps": 8,
    "guidance_scale": 1.0,
    "use_pe": true
  }' \
  --output output.png

Author: unsloth

Likes: 7

Downloads: 0

Tags: ggml, gguf, text-to-image, unsloth, base_model:baidu/ERNIE-Image-Turbo, base_model:quantized:baidu/ERNIE-Image-Turbo, license:apache-2.0, region:us

RohitUltimate/Qwen3.5_VL_2B_12k


license: mit language:

  • en base_model:
  • Qwen/Qwen3.5-2B
  • Qwen/Qwen3.5-2B-Base pipeline_tag: image-text-to-text author: Rohit Dey tags:
  • vLLM
  • unsloth
  • qwen
  • qwen3.5
  • qwen3.5-2B
  • reasoning
  • chain-of-thought
  • qlora library_name: transformers

Model Card: RohitUltimate/Qwen3.5_VL_2B_12k

Description

This model is a fine-tuned vision-language model based on Qwen3.5-2B, optimized for image-text-to-text tasks with extended context length (12k tokens).

Compared to the base and standard fine-tuned variants, this model demonstrates improved performance on instruction-following and multimodal understanding, benefiting from higher-quality training data and better alignment.

It is designed to run efficiently on GPUs with under 8GB VRAM, enabling low-cost deployment without significant performance compromise.


vLLM Inference Pipeline

vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. You can run this model using vLLM with the following Docker command:

docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
  --model RohitUltimate/Qwen3.5_VL_2B_12k \
  --huggingface_token <YOUR_HF_TOKEN> \
  --tokenizer Qwen/Qwen3.5-2B \
  --dtype bfloat16 \
  --trust-remote-code \
  --gpu-memory-utilization 0.9 \
  --max-model-len 12000

Discussion:

If you need more information, have suggestions, or face any issues while using this model, feel free to start a discussion.

Letโ€™s collaborate and grow this community stronger

Author: RohitUltimate

Likes: 7

Downloads: 0

Tags: transformers, safetensors, qwen3_5, image-text-to-text, vLLM, unsloth, qwen, qwen3.5, qwen3.5-2B, reasoning, chain-of-thought, qlora, conversational, en, base_model:Qwen/Qwen3.5-2B, base_model:finetune:Qwen/Qwen3.5-2B, license:mit, endpoints_compatible, region:us

Jiunsong/SuperGemma4-31b-abliterated-GGUF


license: gemma library_name: gguf pipeline_tag: text-generation base_model: google/gemma-4-31b-it base_model_relation: finetune language:

  • en
  • ko tags:
  • gemma
  • gemma4
  • gguf
  • llama.cpp
  • q4_k_m
  • uncensored
  • chat
  • coding
  • reasoning

SuperGemma4-31b-abliterated-GGUF

If this release helps you, support future drops on Ko-fi.

SuperGemma4-31b-abliterated-GGUF is the GGUF release of SuperGemma4-31b-abliterated, packaged for llama.cpp-compatible runtimes and built for people who want a fully uncensored, harder-hitting Gemma 4 31B experience on local hardware.

This release keeps the same product direction as the MLX version:

  • fully uncensored chat with fewer brakes
  • stronger coding and technical help
  • sharper reasoning and planning
  • better real-world usefulness for local users
  • a surprisingly lightweight-feeling 31B deployment path for GGUF users

What you get

  • GGUF quantized weights for local deployment
  • Gemma chat template alongside the model files
  • a straightforward path for llama.cpp, LM Studio, and other GGUF tooling

Why people will like it

This release was pushed toward the things end users notice immediately:

  • much more open uncensored conversation
  • stronger coding, debugging, and implementation help
  • more useful answers on practical prompts instead of generic filler
  • a local experience that feels sharper, more direct, and more builder-friendly
  • a 31B release that feels leaner and punchier than the label suggests

In short: this is the Gemma 4 31B local drop for people who want fewer brakes, more capability, and a more addictive day-to-day experience.

Recommended usage

Example with llama.cpp:

llama-cli -m SuperGemma4-31b-abliterated.Q4_K_M.gguf -p "Write a clean FastAPI CRUD example." -n 256

Included clean-output helper

This release includes supergemma_guard.py and supergemma_gguf_guarded_generate.py for app-layer cleanup when you want especially clean JSON or stricter final formatting.

Example:

python supergemma_gguf_guarded_generate.py \
  --model SuperGemma4-31b-abliterated.Q4_K_M.gguf \
  --chat-template-file chat_template.jinja \
  --prompt 'Return only valid JSON with keys "title" and "steps".'

Recommended behaviors:

  • require raw JSON for JSON-only endpoints
  • strip stray internal markers before presenting answers
  • keep structured-output prompts explicit and narrow

Support

If you want to support more uncensored local model releases, benchmarks, and packaging work:

Author: Jiunsong

Likes: 6

Downloads: 0

Tags: gguf, gemma, gemma4, llama.cpp, q4_k_m, uncensored, chat, coding, reasoning, text-generation, conversational, en, ko, license:gemma, endpoints_compatible, region:us

Comfy-Org/SAM3


license: other license_name: sam-license license_link: LICENSE

WIP, for PR: https://github.com/Comfy-Org/ComfyUI/pull/13408

Author: Comfy-Org

Likes: 5

Downloads: 0

Tags: license:other, region:us

tokinasin/llm-jp-4-8b-instruct-uncensored-ara


license: apache-2.0 language:

  • en
  • ja programming_language:
  • C
  • C++
  • C#
  • Go
  • Java
  • JavaScript
  • Lua
  • PHP
  • Python
  • Ruby
  • Rust
  • Scala
  • TypeScript pipeline_tag: text-generation library_name: transformers inference: false tags:
  • heretic
  • uncensored
  • decensored
  • abliterated
  • ara base_model:
  • llm-jp/llm-jp-4-8b-instruct

Heretic ใฎ PR #211 ใงๆๆกˆใ•ใ‚Œใฆใ„ใ‚‹ Arbitrary-Rank Ablation (ARA) ใ‚’็”จใ„ใฆ llm-jp/llm-jp-4-8b-instruct ใซๅฏพใ—ใฆๆคœ้–ฒ่งฃ้™คใ‚’่กŒใฃใŸใƒขใƒ‡ใƒซใงใ™ใ€‚

Abliteration parameters

| Parameter | Value | | :-------- | :---: | | start_layer_index | 1 | | end_layer_index | 26 | | preserve_good_behavior_weight | 0.9630 | | steer_bad_behavior_weight | 0.0063 | | overcorrect_relative_weight | 0.4803 | | neighbor_count | 11 |

Performance

| Metric | This model | Original model (llm-jp/llm-jp-4-8b-instruct) | | :----- | :--------: | :---------------------------: | | KL divergence | 0.0781 | 0 (by definition) | | Refusals | 5/100 | 99/100 |


llm-jp-4-8b-instruct

LLM-jp-4 is a series of large language models developed by the Research and Development Center for Large Language Models at the National Institute of Informatics.

This repository provides the llm-jp-4-8b-instruct For an overview of the LLM-jp-4 models across different parameter sizes, please refer to:

Base models are trained with pre-training and mid-training only. Post-trained models are aligned using supervised fine-tuning (SFT) and direct preference optimization (DPO), without reinforcement learning.

[!NOTE] While the thinking variants are trained with both SFT and DPO, this instruct model is trained using SFT only, without DPO.

For practical usage examples and detailed instructions on how to use the models, please also refer to our cookbook.

Usage

Please refer to our cookbook for practical usage examples and detailed instructions on how to use the models.

Model Details

  • Model type: Transformer-based Language Model
  • Architectures:

Dense model: |Params|Layers|Hidden size|Heads|Context length|Embedding parameters|Non-embedding parameters|Total parameters| |:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| |8B|32|4,096|32|65,536|805,306,368|7,784,894,464|8,590,200,832|

MoE model: |Params|Layers|Hidden size|Heads|Routed Experts|Activated Experts|Context length|Embedding parameters|Non-embedding parameters|Activated parameters|Total parameters| |:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| |32B-A3B|32|2,560|40|128|8|65,536|503,316,480|31,635,712,512|3,827,476,992|32,139,028,992|

Tokenizer

The tokenizer of this model is based on huggingface/tokenizers Unigram byte-fallback model. The vocabulary entries were converted from llm-jp-tokenizer v4.0. Please refer to README.md of llm-jp-tokenizer for details on the vocabulary construction procedure (the pure SentencePiece training does not reproduce our vocabulary).

[!NOTE] The chat template of this model is designed to be compatible with the OpenAI Harmony response format. However, the tokenizer differs from the one assumed by the openai-harmony library, and therefore direct tokenization with openai-harmony is not supported. For correct behavior, please use the tokenizer provided with this model. For detailed usage, please refer to our cookbook.

Training

Pre-training

This model is trained through a multi-stage pipeline consisting of pre-training and mid-training phases, using a total of 11.7T tokens.

pretraining_overview

The corpora used for pre-training and mid-training are publicly available at the following links:

[!NOTE] Although most of the corpora have been released, some portions are excluded from public release due to licensing constraints.

Post-training

We have fine-tuned the pre-trained checkpoint using SFT and further aligned it with DPO.

The datasets used for post-training are also publicly available at the following links:

Evaluation

llm-jp-judge

We evaluated the model on a variety of tasks using an LLM-as-a-Judge framework. The descriptions of each task are as follows.

  • MT-Bench (JA/EN): A benchmark for measuring multi-turn conversational task-solving ability.
  • AnswerCarefully: A benchmark for evaluating safety in Japanese. We used 336 questions from the v2.0 test set.
  • llm-jp-instructions: A set of human-created single-turn questionโ€“answer pairs. We used 400 questions from the test set.

We evaluated the models using gpt-5.4-2026-03-05.

[!NOTE] Note: In earlier evaluations of the llm-jp-3 series, we used gpt-4o-2024-08-06. The newer evaluator gpt-5.4-2026-03-05 provides a stricter and more reliable assessment, which results in lower scores on benchmarks such as MT-Bench compared to those reported for the llm-jp-3 series.

The scores represent the average values obtained from three rounds of inference and evaluation. For more details, please refer to the codes.

| Model Name | MT-Bench (JA) | MT-Bench (EN) | AnswerCarefully | llm-jp-instructions | |:-------------------------------------------------------------------------------------------------------|----:|----:|----------------:|--------------------:| | gpt-4o-2024-08-06 | 7.29 | 7.69 | 4.00 | 4.07 | | gpt-5.4-2026-03-05 (reasoning_effort = low) | 8.87 | 8.76 | 4.38 | 4.79 | | gpt-5.4-2026-03-05 (reasoning_effort = medium) | 8.87 | 8.89 | 4.43 | 4.82 | | gpt-5.4-2026-03-05 (reasoning_effort = high) | 8.98 | 8.85 | 4.41 | 4.83 | | gpt-oss-20b (reasoning_effort = low) | 7.21 | 7.95 | 3.39 | 3.08 | | gpt-oss-20b (reasoning_effort = medium) | 7.33 | 7.85 | 3.55 | 3.16 | | llm-jp-4-8b-thinking (reasoning_effort = low) | 7.23 | 7.54 | 3.58 | 3.50 | | llm-jp-4-8b-thinking (reasoning_effort = medium) | 7.54 | 7.79 | 3.69 | 3.54 | | llm-jp-4-32b-a3b-thinking (reasoning_effort = low) | 7.57 | 7.70 | 3.61 | 3.61 | | llm-jp-4-32b-a3b-thinking (reasoning_effort = medium) | 7.82 | 7.86 | 3.70 | 3.61 |

Risks and Limitations

The models released here are in the early stages of our research and development and have not been tuned to ensure outputs align with human intent and safety considerations.

Send Questions to

llm-jp(at)nii.ac.jp

License

Apache License, Version 2.0

Acknowledgement

To develop this model, we used the NINJAL Web Japanese Corpus (whole-NWJC) from the National Institute for Japanese Language and Linguistics (NINJAL).

Model Card Authors

The names are listed in alphabetical order.

Hirokazu Kiyomaru and Takashi Kodama.

Author: tokinasin

Likes: 5

Downloads: 0

Tags: transformers, safetensors, llama, text-generation, heretic, uncensored, decensored, abliterated, ara, conversational, en, ja, base_model:llm-jp/llm-jp-4-8b-instruct, base_model:finetune:llm-jp/llm-jp-4-8b-instruct, license:apache-2.0, text-generation-inference, region:us

Youssofal/MiniMax-M2.7-Abliterated-Heretic-MLX


base_model: MiniMaxAI/MiniMax-M2.7 library_name: mlx pipeline_tag: text-generation license: other license_name: non-commercial license_link: https://github.com/MiniMax-AI/MiniMax-M2.7/blob/main/LICENSE tags:

  • mlx
  • mlx-lm
  • minimax
  • minimax_m2
  • moe
  • mixture-of-experts
  • abliterated
  • uncensored
  • heretic
  • ara quantized_by: Youssofal

MiniMax-M2.7-Abliterated-Heretic-MLX

This is an MLX release of an abliterated version of MiniMaxAI's MiniMax-M2.7.

By applying Heretic's Ablated Refusal Adaptation (ARA), the base refusal behavior was removed at the weight level. The result keeps MiniMax-M2.7's sparse MoE reasoning, long-context instruction following, and general capability profile, but no longer defaults to the original refusal pattern.

Methodology & Model Notes

MiniMax-M2.7 is a 229B sparse MoE model with 10B active parameters per token, 62 layers, hybrid attention, 256 local experts with 8 active per token, and a 200K context window.

This release was produced with a direct Heretic ARA run using the fixed parameter set below:

  • start_layer_index = 30
  • end_layer_index = 51
  • preserve_good_behavior_weight = 0.4512
  • steer_bad_behavior_weight = 0.0037
  • overcorrect_relative_weight = 0.8804
  • neighbor_count = 14

The direct ARA run completed with Refusals: 0/25.

The resulting abliterated checkpoint was exported to BF16 and then converted into Apple MLX variants for local deployment on Apple Silicon hardware.

Files

  • MiniMax-M2.7-Abliterated-MLX-3bit-mixed_3_4/: Layer-aware 3-bit MLX build with 4-bit treatment on sensitive modules
  • MiniMax-M2.7-Abliterated-MLX-4bit-mixed_4_5/: Layer-aware 4-bit MLX build with 5-bit treatment on sensitive modules
  • MiniMax-M2.7-Abliterated-MLX-5bit-tuned/: 5-bit MLX build with tighter grouping on sensitive modules
  • MiniMax-M2.7-Abliterated-MLX-6bit-tuned/: 6-bit MLX build with tighter grouping on sensitive modules

Only variants that pass the local MLX validation suite should remain published in this repo.

Prompt Format

]~!b[]~b]system
{system_prompt}[e~[
]~b]user
{prompt}[e~[
]~b]ai
<think>

Running

from mlx_lm import load, generate

model, tokenizer = load("Youssofal/MiniMax-M2.7-Abliterated-Heretic-MLX")

messages = [{"role": "user", "content": "Write a short Python function that reverses a string."}]
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

response = generate(model, tokenizer, prompt=prompt, max_tokens=256)
print(response)

Validation

Every published variant is intended to pass:

  • the official 20-prompt refusal check from mlabonne/harmful_behaviors
  • a local coherence smoke check for normal conversation, reasoning, and code output

Model Architecture

| Spec | Value | |---|---| | Total Parameters | 229B (sparse MoE) | | Active Parameters | 10B per token | | Experts | 256 local, 8 per token | | Layers | 62 | | Attention | Hybrid: 7 Lightning + 1 softmax per 8-block | | Context | 200K | | Base Model | MiniMaxAI/MiniMax-M2.7 |

Disclaimer

This model has had refusal behavior removed at the weight level. It will answer prompts that the base model would normally refuse. You are responsible for how you use it.

Credits

License

This release inherits the base MiniMax-M2.7 license.

NON-COMMERCIAL. Commercial use requires written authorization from MiniMax.

Author: Youssofal

Likes: 4

Downloads: 0

Tags: mlx, mlx-lm, minimax, minimax_m2, moe, mixture-of-experts, abliterated, uncensored, heretic, ara, text-generation, base_model:MiniMaxAI/MiniMax-M2.7, base_model:finetune:MiniMaxAI/MiniMax-M2.7, license:other, region:us

loay/English-Document-OCR-Qwen3.5-0.8B


language:

  • en
  • de
  • tr
  • fr
  • es
  • it
  • pt
  • nl
  • pl
  • sv
  • da
  • 'no'
  • fi
  • cs
  • hu
  • ro
  • vi
  • id
  • ms
  • sw
  • tl
  • zu
  • hr
  • sk
  • sl
  • et
  • lv
  • lt
  • mt
  • is
  • sq license: cc-by-nc-sa-4.0 base_model:
  • unsloth/Qwen3.5-0.8B
  • Qwen/Qwen3.5-0.8B tags:
  • ocr
  • vision-language-model
  • document-understanding
  • qwen3
  • qwen3.5
  • gguf
  • english
  • archival
  • text-extraction
  • complex-layout pipeline_tag: image-text-to-text

English-Document-OCR-Qwen3.5-0.8B

I built this model as part of my ongoing work in document digitization and archival OCR. My goal was to create a small, locally-runnable model that punches above its weight class. This 0.8B release keeps the same overall direction as my earlier Qwen OCR work, but uses improved dataset samples, stronger formatting targets, and better layout preservation. In practice, it produces better output than my previous 2B model on the kinds of structured document OCR I care about most.

This is an updated smaller release focused on English archival and document OCR. If you try it on your documents, I'd love to hear how it performs, feel free to leave feedback in the Community tab.

License: This model is intended for personal and research use only. If you want to use this model in a product or service, or need to process documents commercially, contact ocr@loay.net.


Model Details

  • Fine-tuned by: loay
  • Base Model: unsloth/Qwen3.5-0.8B
  • Task: Document OCR
  • Training Data: Improved synthetic English document samples following the same family of dataset styles as my earlier Qwen OCR releases, including faded ink, bleed-through artifacts, skewed layouts, historical serif typefaces, charts, figures, formulas, and more challenging structured page compositions
  • Output Format: A markdown-first transcription format that preserves paragraph flow and layout structure, uses HTML for tables, uses LaTeX for formulas, emits [image] tags for figures/images, and emits [chart: ...] tags when extracting chart content
  • Language Support: This release is optimized for English. Iโ€™m planning to release versions for additional languages soon, including support for right-to-left document OCR. See my OCR finetuned models for future updates and related releases.

Usage

The model does not require a specific prompt. It will perform OCR on any document image by default. To achieve the best results and prevent conversational hallucinations, use the exact instruction the model was fine-tuned on:

Extract all visible text from this document image and return only the transcription in reading order using a markdown-first format. Use HTML only for tables. Use LaTeX only for formulas.


GGUF & Local Inference

Quantized GGUF files are available for use with llama.cpp, LM Studio, Ollama, and similar runtimes.

You must load mmproj-english-document-ocr-qwen3.5-0.8b-f16.gguf alongside your chosen weight file. Without the multimodal projector, the model cannot process images.

| File | Use Case | |------|----------| | english-document-ocr-qwen3.5-0.8b-f16.gguf | Full precision, maximum accuracy | | english-document-ocr-qwen3.5-0.8b-q8_0.gguf | Best quality/size tradeoff for OCR precision | | english-document-ocr-qwen3.5-0.8b-q6_k.gguf | High quality, lower VRAM | | english-document-ocr-qwen3.5-0.8b-q5_k_m.gguf | Balanced quality and speed | | english-document-ocr-qwen3.5-0.8b-q4_k_m.gguf | Fast, efficient local inference | | mmproj-english-document-ocr-qwen3.5-0.8b-f16.gguf | Required multimodal projector (load with any weight above) |

Example with llama.cpp:

llama-cli       --model english-document-ocr-qwen3.5-0.8b-q4_k_m.gguf       --mmproj mmproj-english-document-ocr-qwen3.5-0.8b-f16.gguf       --image your_document.jpg

Limitations

  • Trained exclusively on synthetic data. May degrade on severe real-world scan artifacts outside the training distribution.
  • No handwriting support, relies on base model zero-shot for cursive or marginalia.
  • Trained to represent figures and embedded images with [image] tags and to extract chart content using [chart: ...] tags, but performance on complex real-world charts and scientific figures can still be inconsistent.
  • Supports LaTeX-style formula output as used in the training pipeline, but difficult mathematical layouts may still degrade on dense or low-quality scans.
  • Optimized for LTR latin scripts. For Arabic/RTL documents, see my OCR models.
  • May hallucinate or break on very long context from dense pages. If your document is text-heavy, consider splitting it into sections before inference.

Author: loay

Likes: 4

Downloads: 0

Tags: gguf, ocr, vision-language-model, document-understanding, qwen3, qwen3.5, english, archival, text-extraction, complex-layout, image-text-to-text, en, de, tr, fr, es, it, pt, nl, pl, sv, da, no, fi, cs, hu, ro, vi, id, ms, sw, tl, zu, hr, sk, sl, et, lv, lt, mt, is, sq, base_model:Qwen/Qwen3.5-0.8B, base_model:quantized:Qwen/Qwen3.5-0.8B, license:cc-by-nc-sa-4.0, endpoints_compatible, region:us, conversational

unsloth/ERNIE-Image-GGUF


base_model: baidu/ERNIE-Image license: apache-2.0 pipeline_tag: text-to-image library_name: ggml tags:

  • text-to-image
  • gguf
  • unsloth widget:
  • text: LLM teaching sloth output: url: assets/Ernie-Image-GGUF.png


This is a GGUF quantized version of ERNIE-Image. <br> unsloth/ERNIE-Image-GGUF uses Unsloth Dynamic 2.0 methodology for SOTA performance.

  • Important layers are upcasted to higher precision.
  • Uses tooling from ComfyUI-GGUF by city96.
<div> <div style="display: flex; gap: 5px; align-items: center; "> <a href="https://github.com/unslothai/unsloth/"> <img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="133"> </a> <a href="https://discord.gg/unsloth"> <img src="https://github.com/unslothai/unsloth/raw/main/images/Discord%20button.png" width="173"> </a> <a href="https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs"> <img src="https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/documentation%20green%20button.png" width="143"> </a> </div> </div>

ERNIE-Image

<p align="center"> <a href="https://huggingface.co/Baidu/ERNIE-Image">๐Ÿค— ERNIE-Image</a> &nbsp;|&nbsp; <a href="https://huggingface.co/Baidu/ERNIE-Image-Turbo">๐Ÿค— ERNIE-Image-Turbo</a> &nbsp;|&nbsp; <a href="https://huggingface.co/spaces/baidu/ERNIE-Image-Turbo">๐Ÿ–ฅ๏ธ Huggingface Demo</a> &nbsp;|&nbsp; <br/> <a href="https://aistudio.baidu.com/ernieimage">๐Ÿ–ฅ๏ธ AI Studio Demo</a> &nbsp;|&nbsp; <a href="https://yiyan.baidu.com/blog/posts/ernie-image">๐Ÿ“– Blog</a> &nbsp;|&nbsp; <a href="https://ernieimageprompt.com/">๐Ÿ–ผ๏ธ Art Gallery</a> <br/> <a href="https://github.com/baidu/ERNIE-Image/blob/main/assets/contacts/WeChat_small.jpg">๐Ÿ’ฌ WeChat(ๅพฎไฟก)</a> &nbsp;|&nbsp; <a href="https://discord.gg/ByUTbjfG5k">๐Ÿซจ Discord</a> &nbsp;|&nbsp; <a href="https://x.com/ErnieforDevs">๐Ÿท๏ธ X</a> </p>

ERNIE-Image is an open text-to-image generation model developed by the ERNIE-Image team at Baidu. It is built on a single-stream Diffusion Transformer (DiT) and paired with a lightweight Prompt Enhancer that expands brief user inputs into richer structured descriptions. With only 8B DiT parameters, it reaches state-of-the-art performance among open-weight text-to-image models. The model is designed not only for strong visual quality, but also for controllability in practical generation scenarios where accurate content realization matters as much as aesthetics. In particular, ERNIE-Image performs strongly on complex instruction following, text rendering, and structured image generation, making it well suited for commercial posters, comics, multi-panel layouts, and other content creation tasks that require both visual quality and precise control. It also supports a broad range of visual styles, including realistic photography, design-oriented imagery, and more stylized aesthetic outputs.

<p align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/5f8d780e5d083370c711f575/QRt1mPSU9SCkcxxFWQje2.jpeg" alt="ERNIE-Image Mosaic" width="100%"> </p>

Highlights:

  • Compact but strong: Despite its compact 8B scale, ERNIE-Image remains highly competitive with substantially larger open-weight models across a range of benchmarks.
  • Text rendering: ERNIE-Image performs particularly well on dense, long-form, and layout-sensitive text, making it a strong choice for posters, infographics, UI-like images, and other text-heavy visual content.
  • Instruction following: The model is able to follow complex prompts involving multiple objects, detailed relationships, and knowledge-intensive descriptions with strong reliability.
  • Structured generation: ERNIE-Image is especially effective for structured visual tasks such as posters, comics, storyboards, and multi-panel compositions, where layout and organization are critical.
  • Style coverage: In addition to clean and readable design-oriented outputs, the model also supports realistic photography and distinctive stylized aesthetics, including softer and more cinematic visual tones.
  • Practical deployment: Thanks to its compact size, ERNIE-Image can run on consumer GPUs with 24G VRAM, which lowers the barrier for research, downstream use, and model adaptation.

Released Versions

ERNIE-Image: Our SFT model, delivers stronger general-purpose capability and instruction fidelity in typically 50 inference steps.

ERNIE-Image-Turbo: Our Turbo model, optimized by DMD and RL, achieves faster speed and higher aesthetics in only 8 inference steps.

Benchmark

GENEval

| Model | Single Object | Two Object | Counting | Colors | Position | Attribute Binding | Overall | |---|---:|---:|---:|---:|---:|---:|---:| | ERNIE-Image (w/o PE) | 1.0000 | 0.9596 | 0.7781 | 0.9282 | 0.8550 | 0.7925 | 0.8856 | | ERNIE-Image (w/ PE) | 0.9906 | 0.9596 | 0.8187 | 0.8830 | 0.8625 | 0.7225 | 0.8728 | | Qwen-Image | 0.9900 | 0.9200 | 0.8900 | 0.8800 | 0.7600 | 0.7700 | 0.8683 | | ERNIE-Image-Turbo (w/o PE) | 1.0000 | 0.9621 | 0.7906 | 0.9202 | 0.7975 | 0.7300 | 0.8667 | | ERNIE-Image-Turbo (w/ PE) | 0.9938 | 0.9419 | 0.8375 | 0.8351 | 0.7950 | 0.7025 | 0.8510 | | FLUX.2-klein-9B | 0.9313 | 0.9571 | 0.8281 | 0.9149 | 0.7175 | 0.7400 | 0.8481 | | Z-Image | 1.0000 | 0.9400 | 0.7800 | 0.9300 | 0.6200 | 0.7700 | 0.8400 | | Z-Image-Turbo | 1.0000 | 0.9500 | 0.7700 | 0.8900 | 0.6500 | 0.6800 | 0.8233 |

OneIG-EN

| Model | Alignment | Text | Reasoning | Style | Diversity | Overall | |---|---:|---:|---:|---:|---:|---:| | Nano Banana 2.0 | 0.8880 | 0.9440 | 0.3340 | 0.4810 | 0.2450 | 0.5780 | | Seedream 4.5 | 0.8910 | 0.9980 | 0.3500 | 0.4340 | 0.2070 | 0.5760 | | ERNIE-Image (w/ PE) | 0.8678 | 0.9788 | 0.3566 | 0.4309 | 0.2411 | 0.5750 | | Seedream 4.0 | 0.8920 | 0.9830 | 0.3470 | 0.4530 | 0.1910 | 0.5730 | | ERNIE-Image-Turbo (w/ PE) | 0.8676 | 0.9666 | 0.3537 | 0.4191 | 0.2212 | 0.5656 | | ERNIE-Image (w/o PE) | 0.8909 | 0.9668 | 0.2950 | 0.4471 | 0.1687 | 0.5537 | | Z-Image | 0.8810 | 0.9870 | 0.2800 | 0.3870 | 0.1940 | 0.5460 | | Qwen-Image | 0.8820 | 0.8910 | 0.3060 | 0.4180 | 0.1970 | 0.5390 | | ERNIE-Image-Turbo (w/o PE) | 0.8795 | 0.9488 | 0.2913 | 0.4277 | 0.1232 | 0.5341 | | FLUX.2-klein-9B | 0.8871 | 0.8657 | 0.3117 | 0.4417 | 0.1560 | 0.5324 | | Qwen-Image-2512 | 0.8760 | 0.9900 | 0.2920 | 0.3380 | 0.1510 | 0.5300 | | GLM-Image | 0.8050 | 0.9690 | 0.2980 | 0.3530 | 0.2130 | 0.5280 | | Z-Image-Turbo | 0.8400 | 0.9940 | 0.2980 | 0.3680 | 0.1390 | 0.5280 |

OneIG-ZH

| Model | Alignment | Text | Reasoning | Style | Diversity | Overall | |---|---:|---:|---:|---:|---:|---:| | Nano Banana 2.0 | 0.8430 | 0.9830 | 0.3110 | 0.4610 | 0.2360 | 0.5670 | | ERNIE-Image (w/ PE) | 0.8299 | 0.9539 | 0.3056 | 0.4342 | 0.2478 | 0.5543 | | Seedream 4.0 | 0.8360 | 0.9860 | 0.3040 | 0.4430 | 0.2000 | 0.5540 | | Seedream 4.5 | 0.8320 | 0.9860 | 0.3000 | 0.4260 | 0.2130 | 0.5510 | | Qwen-Image | 0.8250 | 0.9630 | 0.2670 | 0.4050 | 0.2790 | 0.5480 | | ERNIE-Image-Turbo (w/ PE) | 0.8258 | 0.9386 | 0.3043 | 0.4208 | 0.2281 | 0.5435 | | Z-Image | 0.7930 | 0.9880 | 0.2660 | 0.3860 | 0.2430 | 0.5350 | | ERNIE-Image (w/o PE) | 0.8421 | 0.8979 | 0.2656 | 0.4212 | 0.1772 | 0.5208 | | Qwen-Image-2512 | 0.8230 | 0.9830 | 0.2720 | 0.3420 | 0.1570 | 0.5150 | | GLM-Image | 0.7380 | 0.9760 | 0.2840 | 0.3350 | 0.2210 | 0.5110 | | Z-Image-Turbo | 0.7820 | 0.9820 | 0.2760 | 0.3610 | 0.1340 | 0.5070 | | ERNIE-Image-Turbo (w/o PE) | 0.8326 | 0.9086 | 0.2580 | 0.4002 | 0.1316 | 0.5062 | | FLUX.2-klein-9B | 0.8201 | 0.4920 | 0.2599 | 0.4166 | 0.1625 | 0.4302 |

LongTextBench

| Model | LongText-Bench-EN | LongText-Bench-ZH | Avg | |---|---:|---:|---:| | Seedream 4.5 | 0.9890 | 0.9873 | 0.9882 | | ERNIE-Image (w/ PE) | 0.9804 | 0.9661 | 0.9733 | | GLM-Image | 0.9524 | 0.9788 | 0.9656 | | ERNIE-Image-Turbo (w/ PE) | 0.9675 | 0.9636 | 0.9655 | | Nano Banana 2.0 | 0.9808 | 0.9491 | 0.9650 | | ERNIE-Image-Turbo (w/o PE) | 0.9602 | 0.9675 | 0.9639 | | ERNIE-Image (w/o PE) | 0.9679 | 0.9594 | 0.9636 | | Qwen-Image-2512 | 0.9561 | 0.9647 | 0.9604 | | Qwen-Image | 0.9430 | 0.9460 | 0.9445 | | Z-Image | 0.9350 | 0.9360 | 0.9355 | | Seedream 4.0 | 0.9214 | 0.9261 | 0.9238 | | Z-Image-Turbo | 0.9170 | 0.9260 | 0.9215 | | FLUX.2-klein-9B | 0.8642 | 0.2183 | 0.5413 |

Quick Start

Recommended Parameters

  • Resolution:
    • 1024x1024
    • 848x1264
    • 1264x848
    • 768x1376
    • 896x1200
    • 1376x768
    • 1200x896
  • Guidance scale: 4.0
  • Inference steps: 50

Diffusers

pip install git+https://github.com/huggingface/diffusers

import torch
from diffusers import ErnieImagePipeline

pipe = ErnieImagePipeline.from_pretrained(
    "Baidu/ERNIE-Image",
    torch_dtype=torch.bfloat16,
).to("cuda")

image = pipe(
    prompt="This is a photograph depicting an urban street scene. Shot at eye level, it shows a covered pedestrian or commercial street. Slightly below the center of the frame, a cyclist rides away from the camera toward the background, appearing as a dark silhouette against backlighting with indistinct details. The ground is paved with regular square tiles, bisected by a prominent tactile paving strip running through the scene, whose raised textures are clearly visible under the light. Light streams in diagonally from the right side of the frame, creating a strong backlight effect with a distinct Tyndall effectโ€”visible light beams illuminating dust or vapor in the air and casting long shadows across the street. Several pedestrians appear on the left side and in the distance, some with their backs to the camera and others walking sideways, all rendered as silhouettes or semi-silhouettes. The overall color palette is warm, dominated by golden yellows and dark browns, evoking the atmosphere of dusk or early morning.",
    height=1264,
    width=848,
    num_inference_steps=50,
    guidance_scale=4.0,
    use_pe=True # use prompt enhancer
).images[0]

image.save("output.png")

SGLang

Install the latest version of sglang:

git clone https://github.com/sgl-project/sglang.git

Start the server:

sglang serve --model-path baidu/ERNIE-Image

Send a generation request:

curl -X POST http://localhost:30000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "This is a photograph depicting an urban street scene. Shot at eye level, it shows a covered pedestrian or commercial street. Slightly below the center of the frame, a cyclist rides away from the camera toward the background, appearing as a dark silhouette against backlighting with indistinct details. The ground is paved with regular square tiles, bisected by a prominent tactile paving strip running through the scene, whose raised textures are clearly visible under the light. Light streams in diagonally from the right side of the frame, creating a strong backlight effect with a distinct Tyndall effectโ€”visible light beams illuminating dust or vapor in the air and casting long shadows across the street. Several pedestrians appear on the left side and in the distance, some with their backs to the camera and others walking sideways, all rendered as silhouettes or semi-silhouettes. The overall color palette is warm, dominated by golden yellows and dark browns, evoking the atmosphere of dusk or early morning.",
    "height": 1264,
    "width": 848,
    "num_inference_steps": 50,
    "guidance_scale": 4.0,
    "use_pe": true

  }' \
  --output output.png

Author: unsloth

Likes: 3

Downloads: 0

Tags: ggml, gguf, text-to-image, unsloth, base_model:baidu/ERNIE-Image, base_model:quantized:baidu/ERNIE-Image, license:apache-2.0, region:us