Todays AI Summary

AI Developments: Real-Time TTS, Enhanced Watermarking, and More

Today's AI landscape features advancements in text-to-speech, image generation, and information extraction, alongside research into model training and security.

Noteworthy Research Papers

  • SkillFactory: Self-Distillation For Learning Cognitive Behaviors (arXiv:2512.04072): This paper introduces SkillFactory, a method for fine-tuning models to learn cognitive skills like answer verification and backtracking during supervised fine-tuning (SFT) before reinforcement learning (RL). The approach uses samples from the model itself, rearranged to provide training data. Results show that SkillFactory helps models generalize to harder tasks post-RL and improves robustness to out-of-domain tasks.
  • MarkTune: Improving the Quality-Detectability Trade-off in Open-Weight LLM Watermarking (arXiv:2512.04044): This paper presents MarkTune, a fine-tuning framework that improves the quality-detectability trade-off in watermarking open-weight language models. MarkTune treats the GaussMark signal as a reward while regularizing against degradation in text quality, leading to more robust and high-quality watermarks.
  • PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation (arXiv:2512.04025): This paper introduces Pyramid Sparse Attention (PSA), a module applicable to both video understanding and generation tasks. PSA introduces multi-level pooled KV representations, enabling finer mask granularity. PSA preserves contextual information and visual fidelity, consistently outperforming or achieving comparable performance over existing sparse attention baselines with superior efficiency-quality trade-offs.

New Models

  • microsoft/VibeVoice-Realtime-0.5B: This real-time text-to-speech model supports streaming text input and long-form speech generation. With only 0.5B parameters, it achieves initial audible speech in approximately 300ms. It achieves a Word Error Rate of 2.00% on the LibriSpeech test-clean set and 2.05% on the SEED test-en set, while also achieving a speaker similarity score of 0.695 and 0.633 respectively.
  • ostris/Z-Image-De-Turbo: This model is a de-distilled version of Tongyi-MAI/Z-Image-Turbo, fine-tuned to break down the turbo distillation. It can be used to train LoRAs or for continued fine-tuning.
  • fastino/gliner2-multi-v1: GLiNER2 extends the original GLiNER architecture to support multi-task information extraction with a schema-driven interface. This base model provides efficient CPU-based inference while maintaining high accuracy across diverse extraction tasks.
  • bartowski/NousResearch_Hermes-4.3-36B-GGUF: This model provides Llamacpp imatrix Quantizations of Hermes-4.3-36B by NousResearch.
  • QuantTrio/DeepSeek-V3.2-Speciale-AWQ: This is an AWQ quantized version of the DeepSeek-V3.2-Speciale model, designed for efficient reasoning and agentic AI tasks.
  • huihui-ai/Huihui-Ministral-3-3B-Instruct-2512-abliterated: This is an uncensored version of mistralai/Ministral-3-3B-Instruct-2512 created with abliteration.
  • crusadersAI/agribot-maize-diagnosis: AgriBot is a machine learning model for diagnosing maize (corn) leaf diseases. The model uses MobileNetV2 as a feature extractor combined with Logistic Regression for classification, achieving 94% accuracy.
  • ritessshhh/FinDeBERTa: FinDeBERTa is a fine-tuned DeBERTa-v3-Large model for multi-label financial event classification. It predicts one or more event types from financial news headlines with state-of-the-art performance.
  • juppy44/plant-identification-2m-vit-b: This model is a ViT-Base model fine-tuned on 2 million curated plant occurrences for fast species-level classification from a single photo.

Key Takeaways

  • Cognitive Skill Learning: SkillFactory offers a method to enhance models' cognitive abilities through self-distillation during SFT, improving generalization and robustness.
  • Watermarking for Open-Weight Models: MarkTune provides a way to embed robust, high-quality water

AI Papers for 2026-02-13

Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision-Language Models (VLMs) have emerged as the primary reward provider, leveraging their rich multimodal priors to guide alignment. However, their computation and memory cost can be substantial, and optimizing a latent diffusion generator through a pixel-space reward introduces a domain mismatch that complicates alignment. In this paper, we propose DiNa-LRM, a diffusion-native latent reward model that formulates preference learning directly on noisy diffusion states. Our method introduces a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty. DiNa-LRM leverages a pretrained latent diffusion backbone with a timestep-conditioned reward head, and supports inference-time noise ensembling, providing a diffusion-native mechanism for test-time scaling and robust rewarding. Across image alignment benchmarks, DiNa-LRM substantially outperforms existing diffusion-based reward baselines and achieves performance competitive with state-of-the-art VLMs at a fraction of the computational cost. In preference optimization, we demonstrate that DiNa-LRM improves preference optimization dynamics, enabling faster and more resource-efficient model alignment.

GENIUS: Generative Fluid Intelligence Evaluation Suite

Unified Multimodal Models (UMMs) have shown remarkable progress in visual generation. Yet, existing benchmarks predominantly assess $\textit{Crystallized Intelligence}$, which relies on recalling accumulated knowledge and learned schemas. This focus overlooks $\textit{Generative Fluid Intelligence (GFI)}$: the capacity to induce patterns, reason through constraints, and adapt to novel scenarios on the fly. To rigorously assess this capability, we introduce $\textbf{GENIUS}$ ($\textbf{GEN}$ Fluid $\textbf{I}$ntelligence Eval$\textbf{U}$ation $\textbf{S}$uite). We formalize $\textit{GFI}$ as a synthesis of three primitives. These include $\textit{Inducing Implicit Patterns}$ (e.g., inferring personalized visual preferences), $\textit{Executing Ad-hoc Constraints}$ (e.g., visualizing abstract metaphors), and $\textit{Adapting to Contextual Knowledge}$ (e.g., simulating counter-intuitive physics). Collectively, these primitives challenge models to solve problems grounded entirely in the immediate context. Our systematic evaluation of 12 representative models reveals significant performance deficits in these tasks. Crucially, our diagnostic analysis disentangles these failure modes. It demonstrates that deficits stem from limited context comprehension rather than insufficient intrinsic generative capability. To bridge this gap, we propose a training-free attention intervention strategy. Ultimately, $\textbf{GENIUS}$ establishes a rigorous standard for $\textit{GFI}$, guiding the field beyond knowledge utilization toward dynamic, general-purpose reasoning. Our dataset and code will be released at: $\href{https://github.com/arctanxarc/GENIUS}{https://github.com/arctanxarc/GENIUS}$.

Data-Efficient Hierarchical Goal-Conditioned Reinforcement Learning via Normalizing Flows

Hierarchical goal-conditioned reinforcement learning (H-GCRL) provides a powerful framework for tackling complex, long-horizon tasks by decomposing them into structured subgoals. However, its practical adoption is hindered by poor data efficiency and limited policy expressivity, especially in offline or data-scarce regimes. In this work, Normalizing flow-based hierarchical implicit Q-learning (NF-HIQL), a novel framework that replaces unimodal gaussian policies with expressive normalizing flow policies at both the high- and low-levels of the hierarchy is introduced. This design enables tractable log-likelihood computation, efficient sampling, and the ability to model rich multimodal behaviors. New theoretical guarantees are derived, including explicit KL-divergence bounds for Real-valued non-volume preserving (RealNVP) policies and PAC-style sample efficiency results, showing that NF-HIQL preserves stability while improving generalization. Empirically, NF-HIQL is evaluted across diverse long-horizon tasks in locomotion, ball-dribbling, and multi-step manipulation from OGBench. NF-HIQL consistently outperforms prior goal-conditioned and hierarchical baselines, demonstrating superior robustness under limited data and highlighting the potential of flow-based architectures for scalable, data-efficient hierarchical reinforcement learning.

Weight Decay Improves Language Model Plasticity

The prevailing paradigm in large language model (LLM) development is to pretrain a base model, then perform further training to improve performance and model behavior. However, hyperparameter optimization and scaling laws have been studied primarily from the perspective of the base model's validation loss, ignoring downstream adaptability. In this work, we study pretraining from the perspective of model plasticity, that is, the ability of the base model to successfully adapt to downstream tasks through fine-tuning. We focus on the role of weight decay, a key regularization parameter during pretraining. Through systematic experiments, we show that models trained with larger weight decay values are more plastic, meaning they show larger performance gains when fine-tuned on downstream tasks. This phenomenon can lead to counterintuitive trade-offs where base models that perform worse after pretraining can perform better after fine-tuning. Further investigation of weight decay's mechanistic effects on model behavior reveals that it encourages linearly separable representations, regularizes attention matrices, and reduces overfitting on the training data. In conclusion, this work demonstrates the importance of using evaluation metrics beyond cross-entropy loss for hyperparameter optimization and casts light on the multifaceted role of that a single optimization hyperparameter plays in shaping model behavior.

FormalJudge: A Neuro-Symbolic Paradigm for Agentic Oversight

As LLM-based agents increasingly operate in high-stakes domains with real-world consequences, ensuring their behavioral safety becomes paramount. The dominant oversight paradigm, LLM-as-a-Judge, faces a fundamental dilemma: how can probabilistic systems reliably supervise other probabilistic systems without inheriting their failure modes? We argue that formal verification offers a principled escape from this dilemma, yet its adoption has been hindered by a critical bottleneck: the translation from natural language requirements to formal specifications. This paper bridges this gap by proposing , a neuro-symbolic framework that employs a bidirectional Formal-of-Thought architecture: LLMs serve as specification compilers that top-down decompose high-level human intent into atomic, verifiable constraints, then bottom-up prove compliance using Dafny specifications and Z3 Satisfiability modulo theories solving, which produces mathematical guarantees rather than probabilistic scores. We validate across three benchmarks spanning behavioral safety, multi-domain constraint adherence, and agentic upward deception detection. Experiments on 7 agent models demonstrate that achieves an average improvement of 16.6% over LLM-as-a-Judge baselines, enables weak-to-strong generalization where a 7B judge achieves over 90% accuracy detecting deception from 72B agents, and provides near-linear safety improvement through iterative refinement.

Learning to Compose for Cross-domain Agentic Workflow Generation

Automatically generating agentic workflows -- executable operator graphs or codes that orchestrate reasoning, verification, and repair -- has become a practical way to solve complex tasks beyond what single-pass LLM generation can reliably handle. Yet what constitutes a good workflow depends heavily on the task distribution and the available operators. Under domain shift, current systems typically rely on iterative workflow refinement to discover a feasible workflow from a large workflow space, incurring high iteration costs and yielding unstable, domain-specific behavior. In response, we internalize a decompose-recompose-decide mechanism into an open-source LLM for cross-domain workflow generation. To decompose, we learn a compact set of reusable workflow capabilities across diverse domains. To recompose, we map each input task to a sparse composition over these bases to generate a task-specific workflow in a single pass. To decide, we attribute the success or failure of workflow generation to counterfactual contributions from learned capabilities, thereby capturing which capabilities actually drive success by their marginal effects. Across stringent multi-domain, cross-domain, and unseen-domain evaluations, our 1-pass generator surpasses SOTA refinement baselines that consume 20 iterations, while substantially reducing generation latency and cost.

GameDevBench: Evaluating Agentic Capabilities Through Game Development

Despite rapid progress on coding agents, progress on their multimodal counterparts has lagged behind. A key challenge is the scarcity of evaluation testbeds that combine the complexity of software development with the need for deep multimodal understanding. Game development provides such a testbed as agents must navigate large, dense codebases while manipulating intrinsically multimodal assets such as shaders, sprites, and animations within a visual game scene. We present GameDevBench, the first benchmark for evaluating agents on game development tasks. GameDevBench consists of 132 tasks derived from web and video tutorials. Tasks require significant multimodal understanding and are complex -- the average solution requires over three times the amount of lines of code and file changes compared to prior software development benchmarks. Agents still struggle with game development, with the best agent solving only 54.5% of tasks. We find a strong correlation between perceived task difficulty and multimodal complexity, with success rates dropping from 46.9% on gameplay-oriented tasks to 31.6% on 2D graphics tasks. To improve multimodal capability, we introduce two simple image and video-based feedback mechanisms for agents. Despite their simplicity, these methods consistently improve performance, with the largest change being an increase in Claude Sonnet 4.5's performance from 33.3% to 47.7%. We release GameDevBench publicly to support further research into agentic game development.

Safety Recovery in Reasoning Models Is Only a Few Early Steering Steps Away

Reinforcement learning (RL) based post-training for explicit chain-of-thought (e.g., GRPO) improves the reasoning ability of multimodal large-scale reasoning models (MLRMs). But recent evidence shows that it can simultaneously degrade safety alignment and increase jailbreak success rates. We propose SafeThink, a lightweight inference-time defense that treats safety recovery as a satisficing constraint rather than a maximization objective. SafeThink monitors the evolving reasoning trace with a safety reward model and conditionally injects an optimized short corrective prefix ("Wait, think safely") only when the safety threshold is violated. In our evaluations across six open-source MLRMs and four jailbreak benchmarks (JailbreakV-28K, Hades, FigStep, and MM-SafetyBench), SafeThink reduces attack success rates by 30-60% (e.g., LlamaV-o1: 63.33% to 5.74% on JailbreakV-28K, R1-Onevision: 69.07% to 5.65% on Hades) while preserving reasoning performance (MathVista accuracy: 65.20% to 65.00%). A key empirical finding from our experiments is that safety recovery is often only a few steering steps away: intervening in the first 1-3 reasoning steps typically suffices to redirect the full generation toward safe completions.

Direct Learning of Calibration-Aware Uncertainty for Neural PDE Surrogates

Neural PDE surrogates are often deployed in data-limited or partially observed regimes where downstream decisions depend on calibrated uncertainty in addition to low prediction error. Existing approaches obtain uncertainty through ensemble replication, fixed stochastic noise such as dropout, or post hoc calibration. Cross-regularized uncertainty learns uncertainty parameters during training using gradients routed through a held-out regularization split. The predictor is optimized on the training split for fit, while low-dimensional uncertainty controls are optimized on the regularization split to reduce train-test mismatch, yielding regime-adaptive uncertainty without per-regime noise tuning. The framework can learn continuous noise levels at the output head, within hidden features, or within operator-specific components such as spectral modes. We instantiate the approach in Fourier Neural Operators and evaluate on APEBench sweeps over observed fraction and training-set size. Across these sweeps, the learned predictive distributions are better calibrated on held-out splits and the resulting uncertainty fields concentrate in high-error regions in one-step spatial diagnostics.

DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement Learning

In the current landscape of Large Language Models (LLMs), the curation of large-scale, high-quality training data is a primary driver of model performance. A key lever is the \emph{data recipe}, which comprises a data processing pipeline to transform raw sources into training corpora. Despite the growing use of LLMs to automate individual data processing steps, such as data synthesis and filtering, the overall design of data recipes remains largely manual and labor-intensive, requiring substantial human expertise and iteration. To bridge this gap, we formulate \emph{end-to-end data recipe generation} for LLM adaptation. Given a target benchmark and a pool of available data sources, a model is required to output a complete data recipe that adapts a base LLM to the target task. We present DataChef-32B, which performs online reinforcement learning using a proxy reward that predicts downstream performance for candidate recipes. Across six held-out tasks, DataChef-32B produces practical recipes that reach comparable downstream performance to those curated by human experts. Notably, the recipe from DataChef-32B adapts Qwen3-1.7B-Base to the math domain, achieving 66.7 on AIME'25 and surpassing Qwen3-1.7B. This work sheds new light on automating LLM training and developing self-evolving AI systems.

AI Models

AIDC-AI/Ovis2.6-30B-A3B


license: apache-2.0 pipeline_tag: image-text-to-text

Ovis2.6-30B-A3B

<div align="center"> <img src=https://cdn-uploads.huggingface.co/production/uploads/637aebed7ce76c3b834cea37/3IK823BZ8w-mz_QfeYkDn.png width="30%"/> </div>

Introduction

We introduce Ovis2.6-30B-A3B, the latest advancement in the Ovis series of Multimodal Large Language Models (MLLMs). Building on the strong foundation of Ovis2.5, Ovis2.6 upgrades the LLM backbone to a Mixture-of-Experts (MoE) architecture, delivering superior multimodal performance at a fraction of the serving cost. It also brings major improvements in long-context and high-resolution understanding, visual reasoning with active image analysis, and information-dense document comprehension.

<div align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/658a8a837959448ef5500ce5/IPsQk8gTTMD-ipTye3WED.png" width="100%" /> </div>

Key Features

  • MoE Architecture: Superior Performance with Low Serving Cost
    The LLM backbone has been upgraded to a Mixture-of-Experts (MoE) architecture. This allows Ovis2.6 to scale up to 30B total parameters, capturing vast amounts of knowledge and nuance. Crucially, it achieves this with only ~3B active parameters during inference, ensuring low serving costs and high throughput.

  • Enhanced Long-Sequence and High-Resolution Processing
    Ovis2.6 extends the context window to 64K tokens and supports image resolutions up to 2880×2880, significantly improving its ability to process high-resolution and information-dense visual inputs. These enhancements are particularly effective for long-document question answering, where the model must gather and synthesize clues scattered across multiple pages to derive the correct answer.

  • Think with Image
    We introduce the "Think with Image" capability, which transforms vision from a passive input into an active cognitive workspace. During reasoning, the model can actively invoke visual tools (e.g., cropping and rotation) to re-examine and analyze image regions within its Chain-of-Thought, enabling multi-turn, self-reflective reasoning over visual inputs for higher accuracy on complex tasks.

  • Reinforced OCR, Document, and Chart Capabilities
    Continuing our focus on information-dense visual tasks, we have further reinforced the model's capabilities in Optical Character Recognition (OCR), document understanding, and chart/diagram analysis. Ovis2.6 excels not only at accurately extracting structured information from visual data, but also at reasoning over the extracted content.

<div align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/658a8a837959448ef5500ce5/3_A0CA-oO0Ie_WoigjAwo.png" width="100%" /> </div>

Performance

table

Quick Inference

Below is a simple example demonstrating how to run Ovis2.6 with a single image input.

First, install the required dependencies:

pip install torch==2.7.1 transformers==4.57.0 numpy==1.25.0 pillow==10.3.0 moviepy==1.0.3 accelerate==1.12.0
pip install --no-build-isolation --no-cache-dir flash-attn==2.8.3

Then, run the following code.

import torch
import requests
from PIL import Image
from transformers import AutoModelForCausalLM

# Thinking mode & budget
enable_thinking = True
enable_thinking_budget = True  # Only effective if enable_thinking is True.

# Total tokens for thinking + answer. Ensure: max_new_tokens > thinking_budget + 25
max_new_tokens = 2048
thinking_budget = 1024

model = AutoModelForCausalLM.from_pretrained(
    "AIDC-AI/Ovis2.6-30B-A3B",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto"
)

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": Image.open(requests.get("https://cdn-uploads.huggingface.co/production/uploads/658a8a837959448ef5500ce5/TIlymOb86R6_Mez3bpmcB.png", stream=True).raw)},
        {"type": "text", "text": "Calculate the sum of the numbers in the middle box in figure (c)."},
    ],
}]

input_ids, pixel_values, grid_thws = model.preprocess_inputs(
    messages=messages,
    add_generation_prompt=True,
    enable_thinking=enable_thinking
)
input_ids = input_ids.cuda()
pixel_values = pixel_values.cuda() if pixel_values is not None else None
grid_thws = grid_thws.cuda() if grid_thws is not None else None

outputs = model.generate(
    inputs=input_ids,
    pixel_values=pixel_values,
    grid_thws=grid_thws,
    enable_thinking=enable_thinking,
    enable_thinking_budget=enable_thinking_budget,
    max_new_tokens=max_new_tokens,
    thinking_budget=thinking_budget,
)

response = model.text_tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

The thinking and thinking budget logic can be applied in the same way for multi-image, video and pure text scenarios.

Note (answer extraction for CoT/Thinking): To make evaluation and usage easier, we recommend appending a fixed suffix to prompts when using chain-of-thought (CoT) or thinking mode. This ensures the model clearly outputs a final answer that can be extracted programmatically:

End your response with 'Final answer: '.

For example:

Calculate the sum of the numbers in the middle box in figure (c).
End your response with 'Final answer: '.

Tip: The sections below include an optional streaming helper (compatible with two-phase thinking/budget runs) and extra inference modes: multi-image, video, and text-only.

<details> <summary>Optional: Streaming (Advanced)</summary>

To support thinking budget, we modified the implementation of the Ovis generate method and the default TextIteratorStreamer is now incompatible. If you need to stream model output, be sure to use the helper class below.

# --- Budget-aware streamer helper ---
from transformers import TextIteratorStreamer

class BudgetAwareTextStreamer(TextIteratorStreamer):
    """A streamer compatible with Ovis two-phase generation.

    Call .manual_end() after generation to flush any remaining text.
    """
    def manual_end(self):
        if len(self.token_cache) > 0:
            text = self.tokenizer.decode(self.token_cache, **self.decode_kwargs)
            printable_text = text[self.print_len:]
            self.token_cache = []
            self.print_len = 0
        else:
            printable_text = ""
        self.next_tokens_are_prompt = True
        self.on_finalized_text(printable_text, stream_end=True)

    # Disable base class's end hook; we'll finalize via manual_end()
    def end(self):
        pass

Example usage:

streamer = BudgetAwareTextStreamer(
    model.text_tokenizer,
    skip_prompt=True,
    skip_special_tokens=True
)

outputs = model.generate(
    inputs=input_ids,
    pixel_values=pixel_values,
    grid_thws=grid_thws,
    enable_thinking=enable_thinking,
    enable_thinking_budget=enable_thinking_budget,
    max_new_tokens=max_new_tokens,
    thinking_budget=thinking_budget,
    streamer=streamer
)

</details> <details> <summary>Example: Multi-image</summary> Demonstrates how to run inference with multiple images and a related question.
# Multi-image inference
multi_image_files = [
    "/path/to/image_1.jpg",
    "/path/to/image_2.jpg",
    "/path/to/image_3.jpg",
]

content = [{"type": "image", "image": Image.open(p).convert("RGB")} for p in multi_image_files]
content.append({"type": "text", "text": "Describe the images."})
messages = [{"role": "user", "content": content}]

input_ids, pixel_values, grid_thws = model.preprocess_inputs(messages=messages, add_generation_prompt=True, max_pixels=896*896)
input_ids = input_ids.cuda()
pixel_values = pixel_values.cuda().to(model.dtype) if pixel_values is not None else None
grid_thws = grid_thws.cuda() if grid_thws is not None else None

with torch.no_grad():
    outputs = model.generate(inputs=input_ids, pixel_values=pixel_values, grid_thws=grid_thws,
                             max_new_tokens=1024, do_sample=True,
                             eos_token_id=model.text_tokenizer.eos_token_id,
                             pad_token_id=model.text_tokenizer.pad_token_id)
print(model.text_tokenizer.decode(outputs[0], skip_special_tokens=True))
</details> <details> <summary>Example: Video</summary> Demonstrates how to run inference on a video by sampling multiple frames and asking the model to describe the content.
# Video inference
from moviepy.editor import VideoFileClip  # pip install moviepy==1.0.3

video_file = "/path/to/video_1.mp4"
num_frames = 8

with VideoFileClip(video_file) as clip:
    total_frames = int(clip.fps * clip.duration)
    indices = [int(i * total_frames / num_frames) for i in range(num_frames)]
    frames = [Image.fromarray(clip.get_frame(t)) for t in (idx / clip.fps for idx in indices)]

messages = [{"role": "user", "content": [
    {"type": "video", "video": frames},
    {"type": "text", "text": "Describe this video in detail."},
]}]

input_ids, pixel_values, grid_thws = model.preprocess_inputs(messages=messages, add_generation_prompt=True, max_pixels=896*896)
input_ids = input_ids.cuda()
pixel_values = pixel_values.cuda().to(model.dtype) if pixel_values is not None else None
grid_thws = grid_thws.cuda() if grid_thws is not None else None

with torch.no_grad():
    outputs = model.generate(inputs=input_ids, pixel_values=pixel_values, grid_thws=grid_thws,
                             max_new_tokens=1024, do_sample=True,
                             eos_token_id=model.text_tokenizer.eos_token_id,
                             pad_token_id=model.text_tokenizer.pad_token_id)
print(model.text_tokenizer.decode(outputs[0], skip_special_tokens=True))
</details> <details> <summary>Example: Text-only</summary> Demonstrates how to run inference using only text input without any images or videos.
# Text-only inference
messages = [{"role": "user", "content": "Hi, please introduce Yellow Mountain."}]

input_ids, _, _ = model.preprocess_inputs(messages=messages, add_generation_prompt=True)
input_ids = input_ids.cuda()

with torch.no_grad():
    outputs = model.generate(inputs=input_ids, max_new_tokens=1024, do_sample=True,
                             eos_token_id=model.text_tokenizer.eos_token_id,
                             pad_token_id=model.text_tokenizer.pad_token_id)
print(model.text_tokenizer.decode(outputs[0], skip_special_tokens=True))
</details>

To enable grounding, end your prompt with Please provide the bounding box coordinates. (for boxes) or Please provide the point coordinates. (for points). To target a specific object, wrap its description in <ref> tags, e.g.:

Find the <ref>red apple</ref> in the image. Please provide the bounding box coordinates.

Coordinates are normalized to [0,1) with the origin (0,0) at the top-left corner of the image.

  • Point: <point>(x,y)</point>
  • Bounding box: <box>(x1,y1),(x2,y2)</box> where (x1,y1) is top-left, (x2,y2) is bottom-right.
  • Multiple results can be listed in square brackets: [<box>(...)</box>,<box>(...)</box> ]

Example:

The image features a serene scene with <ref>three birds</ref>[
  <box>(0.401,0.526),(0.430,0.557)</box>,
  <box>(0.489,0.494),(0.516,0.526)</box>,
  <box>(0.296,0.529),(0.324,0.576)</box>
] flying in formation against a clear blue sky.

Citation

If you find Ovis useful, please consider citing the paper

@article{lu2025ovis25technicalreport,
  title={Ovis2.5 Technical Report}, 
  author={Shiyin Lu and Yang Li and Yu Xia and Yuwei Hu and Shanshan Zhao and Yanqing Ma and Zhichao Wei and Yinglun Li and Lunhao Duan and Jianshan Zhao and Yuxuan Han and Haijun Li and Wanying Chen and Junke Tang and Chengkun Hou and Zhixing Du and Tianli Zhou and Wenjie Zhang and Huping Ding and Jiahe Li and Wen Li and Gui Hu and Yiliang Gu and Siran Yang and Jiamang Wang and Hailong Sun and Yibo Wang and Hui Sun and Jinlong Huang and Yuping He and Shengze Shi and Weihong Zhang and Guodong Zheng and Junpeng Jiang and Sensen Gao and Yi-Feng Wu and Sijia Chen and Yuhui Chen and Qing-Guo Chen and Zhao Xu and Weihua Luo and Kaifu Zhang},
  year={2025},
  journal={arXiv:2508.11737}
}

@article{lu2024ovis,
  title={Ovis: Structural Embedding Alignment for Multimodal Large Language Model},
  author={Shiyin Lu and Yang Li and Qing-Guo Chen and Zhao Xu and Weihua Luo and Kaifu Zhang and Han-Jia Ye},
  year={2024},
  journal={arXiv:2405.20797}
}

License

This project is licensed under the Apache License, Version 2.0 (SPDX-License-Identifier: Apache-2.0).

Disclaimer

We used compliance-checking algorithms during the training process, to ensure the compliance of the trained model to the best of our ability. Due to the complexity of the data and the diversity of language model usage scenarios, we cannot guarantee that the model is completely free of copyright issues or improper content. If you believe anything infringes on your rights or generates improper content, please contact us, and we will promptly address the matter.

Author: AIDC-AI

Likes: 51

Downloads: 0

Tags: safetensors, ovis2_6_moe, image-text-to-text, conversational, custom_code, arxiv:2508.11737, arxiv:2405.20797, license:apache-2.0, region:us

lm-provers/QED-Nano


library_name: transformers license: apache-2.0 language:

  • en base_model:
  • Qwen/Qwen3-4B-Thinking-2507

QED-Nano

logo.png

Table of Contents

  1. Model Summary
  2. How to use
  3. Evaluation
  4. Limitations
  5. License

Model Summary

QED-Nano is a 4B parameter model explicitly post-trained to strengthen its proof-writing capabilities. Despite its small size, QED-Nano achieves an impressive 40% score on the challenging IMO-ProofBench benchmark (+20% over the Qwen3 base model), matching the performance of GPT-OSS-120B from OpenAI. With an agent scaffold that scales inference-time compute to over 1M tokens per problem, QED-Nano approaches the performance of Gemini-3-Pro. Crucially, the same agentic scaffold on the base model (Qwen3-4B-Thinking-2507) barely improves performance.

imoproofbench.png

QED-Nano is based on Qwen/Qwen3-4B-Thinking-2507, and was post-trained via a combination of supervised fine-tuning and reinforcement learning with a reasoning cache (to be able to train for continual improvement with our agentic scaffold at test time) on a mixture of Olympiads proof problems from various public sources.

[!NOTE] We are working to release the full training recipe, including data, code, and agent scaffolds -- stay tuned!

How to use

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "lm-provers/QED-Nano"
device = "cuda"  # for GPU usage or "cpu" for CPU usage

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
).to(device)

# prepare the model input
prompt = "Generate a rigorous proof to the following question: is \sqrt{2} rational or irrational?"
messages_think = [
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages_think,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# Generate the output
generated_ids = model.generate(**model_inputs, max_new_tokens=32768)

# Get and decode the output
output_ids = generated_ids[0][len(model_inputs.input_ids[0]) :]
print(tokenizer.decode(output_ids, skip_special_tokens=True))

[!TIP] We recommend setting temperature=0.6 and top_p=0.95 in the sampling parameters.

vLLM and SGLang

You can use vLLM and SGLang to deploy the model in an API compatible with OpenAI format.

SGLang

python -m sglang.launch_server --model-path lm-provers/QED-Nano

vLLM

vllm serve lm-provers/QED-Nano

Evaluation

In this section, we report the evaluation results of QED-Nano on IMO-ProofBench and ProofBench. All evaluations are reported as avg@3 unless stated otherwise.

evals.png

Limitations

QED-Nano is a domain-specific model that is designed for one thing and one thing only: proving theorems. Using as a general assistant will likely produce nonsense outside of this domain. These models should be used as assistive tools rather than definitive sources of information. Users should always verify important information and critically evaluate any generated content.

License

Apache 2.0

Acknowledgements

QED-Nano is a joint collaboration between the research teams at CMU, ETH Zurich, Numina, and Hugging Face. Below is a list of the individual contributors and their affiliations:

CMU

Amrith Setlur, Yuxiao Qu, Ian Wu, and Aviral Kumar

ETH Zurich

Jasper Dekoninck

Numina

Jia Li

Hugging Face

Edward Beeching and Lewis Tunstall

Author: lm-provers

Likes: 10

Downloads: 0

Tags: transformers, safetensors, qwen3, text-generation, conversational, en, arxiv:2602.03773, base_model:Qwen/Qwen3-4B-Thinking-2507, base_model:finetune:Qwen/Qwen3-4B-Thinking-2507, license:apache-2.0, text-generation-inference, endpoints_compatible, region:us

unsloth/GLM-5-FP8


language:

  • en
  • zh library_name: transformers license: mit pipeline_tag: text-generation base_model:
  • zai-org/GLM-5 tags:
  • unsloth
  • glm_moe_dsa

<div> <p style="margin-bottom: 0; margin-top: 0;"> <h1 style="margin-top: 0rem;">See how to run GLM-5 locally - <a href="https://docs.unsloth.ai/models/glm-5">Read our Guide!</a></h1> </p> <p style="margin-top: 0;margin-bottom: 0;"> <em><a href="https://docs.unsloth.ai/basics/unsloth-dynamic-v2.0-gguf">Unsloth Dynamic 2.0</a> achieves superior accuracy & outperforms other leading quants.</em> </p> <div style="margin-top: 0;display: flex; gap: 5px; align-items: center; "> <a href="https://github.com/unslothai/unsloth/"> <img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="133"> </a> <a href="https://discord.gg/unsloth"> <img src="https://github.com/unslothai/unsloth/raw/main/images/Discord%20button.png" width="173"> </a> <a href="https://docs.unsloth.ai/models/glm-5"> <img src="https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/documentation%20green%20button.png" width="143"> </a> </div> </div>

To run, you must install llama.cpp PR 19460.<br> You can follow instructions in our guide here.


GLM-5

<div align="center"> <img src=https://raw.githubusercontent.com/zai-org/GLM-5/refs/heads/main/resources/logo.svg width="15%"/> </div> <p align="center"> 👋 Join our <a href="https://raw.githubusercontent.com/zai-org/GLM-5/refs/heads/main/resources/wechat.png" target="_blank">WeChat</a> or <a href="https://discord.gg/QR7SARHRxK" target="_blank">Discord</a> community. <br> 📖 Check out the GLM-5 <a href="https://z.ai/blog/glm-5" target="_blank">technical blog</a>. <br> 📍 Use GLM-5 API services on <a href="https://docs.z.ai/guides/llm/glm-5">Z.ai API Platform. </a> <br> 👉 One click to <a href="https://chat.z.ai">GLM-5</a>. </p>

Introduction

We are launching GLM-5, targeting complex systems engineering and long-horizon agentic tasks. Scaling is still one of the most important ways to improve the intelligence efficiency of Artificial General Intelligence (AGI). Compared to GLM-4.5, GLM-5 scales from 355B parameters (32B active) to 744B parameters (40B active), and increases pre-training data from 23T to 28.5T tokens. GLM-5 also integrates DeepSeek Sparse Attention (DSA), largely reducing deployment cost while preserving long-context capacity.

Reinforcement learning aims to bridge the gap between competence and excellence in pre-trained models. However, deploying it at scale for LLMs is a challenge due to the RL training inefficiency. To this end, we developed slime, a novel asynchronous RL infrastructure that substantially improves training throughput and efficiency, enabling more fine-grained post-training iterations. With advances in both pre-training and post-training, GLM-5 delivers significant improvement compared to GLM-4.7 across a wide range of academic benchmarks and achieves best-in-class performance among all open-source models in the world on reasoning, coding, and agentic tasks, closing the gap with frontier models.

Benchmark

| | GLM-5 | GLM-4.7 | DeepSeek-V3.2 | Kimi K2.5 | Claude Opus 4.5 | Gemini 3 Pro | GPT-5.2 (xhigh) | | -------------------------------- | ---------------------- | --------- | ------------- |-----------| --------------- | ------------ | --------------- | | HLE | 30.5 | 24.8 | 25.1 | 31.5 | 28.4 | 37.2 | 35.4 | | HLE (w/ Tools) | 50.4 | 42.8 | 40.8 | 51.8 | 43.4* | 45.8* | 45.5* | | AIME 2026 I | 92.7 | 92.9 | 92.7 | 92.5 | 93.3 | 90.6 | - | | HMMT Nov. 2025 | 96.9 | 93.5 | 90.2 | 91.1 | 91.7 | 93.0 | 97.1 | | IMOAnswerBench | 82.5 | 82.0 | 78.3 | 81.8 | 78.5 | 83.3 | 86.3 | | GPQA-Diamond | 86.0 | 85.7 | 82.4 | 87.6 | 87.0 | 91.9 | 92.4 | | SWE-bench Verified | 77.8 | 73.8 | 73.1 | 76.8 | 80.9 | 76.2 | 80.0 | | SWE-bench Multilingual | 73.3 | 66.7 | 70.2 | 73.0 | 77.5 | 65.0 | 72.0 | | Terminal-Bench 2.0 (Terminus 2) | 56.2 / 60.7 † | 41.0 | 39.3 | 50.8 | 59.3 | 54.2 | 54.0 | | Terminal-Bench 2.0 (Claude Code) | 56.2 / 61.1 † | 32.8 | 46.4 | - | 57.9 | - | - | | CyberGym | 43.2 | 23.5 | 17.3 | 41.3 | 50.6 | 39.9 | - | | BrowseComp | 62.0 | 52.0 | 51.4 | 60.6 | 37.0 | 37.8 | - | | BrowseComp (w/ Context Manage) | 75.9 | 67.5 | 67.6 | 74.9 | 67.8 | 59.2 | 65.8 | | BrowseComp-Zh | 72.7 | 66.6 | 65.0 | 62.3 | 62.4 | 66.8 | 76.1 | | τ²-Bench | 89.7 | 87.4 | 85.3 | 80.2 | 91.6 | 90.7 | 85.5 | | MCP-Atlas (Public Set) | 67.8 | 52.0 | 62.2 | 63.8 | 65.2 | 66.6 | 68.0 | | Tool-Decathlon | 38.0 | 23.8 | 35.2 | 27.8 | 43.5 | 36.4 | 46.3 | | Vending Bench 2 | $4,432.12 | $2,376.82 | $1,034.00 | $1,198.46 | $4,967.06 | $5,478.16 | $3,591.33 |

*: refers to their scores of full set.

†: A verified version of Terminal-Bench 2.0 that fixes some ambiguous instructions. See footnote for more evaluation details.

Footnote

  • Humanity’s Last Exam (HLE) & other reasoning tasks: We evaluate with a maximum generation length of 131,072 tokens (temperature=1.0, top_p=0.95, max_new_tokens=131072). By default, we report the text-only subset; results marked with * are from the full set. We use GPT-5.2 (medium) as the judge model. For HLE-with-tools, we use a maximum context length of 202,752 tokens.
  • SWE-bench & SWE-bench Multilingual: We run the SWE-bench suite with OpenHands using a tailored instruction prompt. Settings: temperature=0.7, top_p=0.95, max_new_tokens=16384, with a 200K context window.
  • BrowserComp: Without context management, we retain details from the most recent 5 turns. With context management, we use the same discard-all strategy as DeepSeek-v3.2 and Kimi K2.5.
  • Terminal-Bench 2.0 (Terminus 2): We evaluate with the Terminus framework using timeout=2h, temperature=0.7, top_p=1.0, max_new_tokens=8192, with a 128K context window. Resource limits are capped at 16 CPUs and 32 GB RAM.
  • Terminal-Bench 2.0 (Claude Code): We evaluate in Claude Code 2.1.14 (think mode, default effort) with temperature=1.0, top_p=0.95, max_new_tokens=65536. We remove wall-clock time limits due to generation speed, while preserving per-task CPU and memory constraints. Scores are averaged over 5 runs. We fix environment issues introduced by Claude Code and also report results on a verified Terminal-Bench 2.0 dataset that resolves ambiguous instructions (see: https://huggingface.co/datasets/zai-org/terminal-bench-2-verified).
  • CyberGym: We evaluate in Claude Code 2.1.18 (think mode, no web tools) with (temperature=1.0, top_p=1.0, max_new_tokens=32000) and a 250-minute timeout per task. Results are single-run Pass@1 over 1,507 tasks.
  • MCP-Atlas: All models are evaluated in think mode on the 500-task public subset with a 10-minute timeout per task. We use Gemini 3 Pro as the judge model.
  • τ²-bench: We add a small prompt adjustment in Retail and Telecom to avoid failures caused by premature user termination. For Airline, we apply the domain fixes proposed in the Claude Opus 4.5 system card.
  • Vending Bench 2: Runs are conducted independently by Andon Labs.

Serve GLM-5 Locally

Prepare environment

vLLM, SGLang, and xLLM all support local deployment of GLM-5. A simple deployment guide is provided here.

  • vLLM

    Using Docker as:

    docker pull vllm/vllm-openai:nightly 
    

    or using pip:

    pip install -U vllm --pre --index-url https://pypi.org/simple --extra-index-url https://wheels.vllm.ai/nightly
    

    then upgrade transformers:

    pip install git+https://github.com/huggingface/transformers.git
    
  • SGLang

    Using Docker as:

    docker pull lmsysorg/sglang:glm5-hopper # For Hopper GPU
    docker pull lmsysorg/sglang:glm5-blackwell # For Blackwell GPU
    

Deploy

  • vLLM

    vllm serve zai-org/GLM-5-FP8 \
         --tensor-parallel-size 8 \
         --gpu-memory-utilization 0.85 \
         --speculative-config.method mtp \
         --speculative-config.num_speculative_tokens 1 \
         --tool-call-parser glm47 \
         --reasoning-parser glm45 \
         --enable-auto-tool-choice \
         --served-model-name glm-5-fp8
    

    Check the recipes for more details.

  • SGLang

    python3 -m sglang.launch_server \
      --model-path zai-org/GLM-5-FP8 \
      --tp-size 8 \
      --tool-call-parser glm47  \
      --reasoning-parser glm45 \
      --speculative-algorithm EAGLE \
      --speculative-num-steps 3 \
      --speculative-eagle-topk 1 \
      --speculative-num-draft-tokens 4 \
      --mem-fraction-static 0.85 \
      --served-model-name glm-5-fp8
    

    Check the sglang cookbook for more details.

  • xLLM and other Ascend NPU

    Please check the deployment guide here.

Citation

Our technical report is coming soon.

Author: unsloth

Likes: 5

Downloads: 0

Tags: transformers, safetensors, glm_moe_dsa, text-generation, unsloth, conversational, en, zh, base_model:zai-org/GLM-5, base_model:quantized:zai-org/GLM-5, license:mit, endpoints_compatible, fp8, region:us

FireRedTeam/FireRedASR2-AED


license: apache-2.0 language:

  • en
  • zh tags:
  • audio
  • automatic-speech-recognition
  • asr

<div align="center"> <h1> FireRedASR2S <br> A SOTA Industrial-Grade All-in-One ASR System </h1> </div>

[Paper] [Model] [Blog] [Demo]

FireRedASR2S is a state-of-the-art (SOTA), industrial-grade, all-in-one ASR system with ASR, VAD, LID, and Punc modules. All modules achieve SOTA performance:

  • FireRedASR2: Automatic Speech Recognition (ASR) supporting Chinese (Mandarin, 20+ dialects/accents), English, code-switching, and singing lyrics recognition. 2.89% average CER on Mandarin (4 test sets), 11.55% on Chinese dialects (19 test sets), outperforming Doubao-ASR, Qwen3-ASR-1.7B, Fun-ASR, and Fun-ASR-Nano-2512. FireRedASR2-AED also supports word-level timestamps and confidence scores.
  • FireRedVAD: Voice Activity Detection (VAD) supporting speech/singing/music in 100+ languages. 97.57% F1, outperforming Silero-VAD, TEN-VAD, and FunASR-VAD. Supports non-streaming/streaming VAD and Audio Event Detection.
  • FireRedLID: Spoken Language Identification (LID) supporting 100+ languages and 20+ Chinese dialects/accents. 97.18% accuracy, outperforming Whisper and SpeechBrain-LID.
  • FireRedPunc: Punctuation Prediction (Punc) for Chinese and English. 78.90% average F1, outperforming FunASR-Punc (62.77%).

2S: 2nd-generation FireRedASR, now expanded to an all-in-one ASR System

🔥 News

  • [2026.02.12] We release FireRedASR2S (FireRedASR2-AED, FireRedVAD, FireRedLID, and FireRedPunc) with model weights and inference code. Download links below. Technical report and finetuning code coming soon.

Available Models and Languages

|Model|Supported Languages & Dialects|Download| |:-------------:|:---------------------------------:|:----------:| |FireRedASR2| Chinese (Mandarin and 20+ dialects/accents<sup></sup>), English, Code-Switching | 🤗 | 🤖| |FireRedVAD | 100+ languages, 20+ Chinese dialects/accents<sup></sup> | 🤗 | 🤖| |FireRedLID | 100+ languages, 20+ Chinese dialects/accents<sup>*</sup> | 🤗 | 🤖| |FireRedPunc| Chinese, English | 🤗 | 🤖|

<sup>*</sup>Supported Chinese dialects/accents: Cantonese (Hong Kong & Guangdong), Sichuan, Shanghai, Wu, Minnan, Anhui, Fujian, Gansu, Guizhou, Hebei, Henan, Hubei, Hunan, Jiangxi, Liaoning, Ningxia, Shaanxi, Shanxi, Shandong, Tianjin, Yunnan, etc.

Method

FireRedASR2

FireRedASR2 builds upon FireRedASR with improved accuracy, designed to meet diverse requirements in superior performance and optimal efficiency across various applications. It comprises two variants:

  • FireRedASR2-LLM: Designed to achieve state-of-the-art performance and to enable seamless end-to-end speech interaction. It adopts an Encoder-Adapter-LLM framework leveraging large language model (LLM) capabilities.
  • FireRedASR2-AED: Designed to balance high performance and computational efficiency and to serve as an effective speech representation module in LLM-based speech models. It utilizes an Attention-based Encoder-Decoder (AED) architecture.

Other Modules

  • FireRedVAD: DFSMN-based non-streaming/streaming Voice Activity Detection and Audio Event Detection.
  • FireRedLID: FireRedASR2-based Spoken Language Identification. See FireRedLID README for language details.
  • FireRedPunc: BERT-based Punctuation Prediction.

Evaluation

FireRedASR2

Metrics: Character Error Rate (CER%) for Chinese and Word Error Rate (WER%) for English. Lower is better.

We evaluate FireRedASR2 on 24 public test sets covering Mandarin, 20+ Chinese dialects/accents, and singing.

  • Mandarin (4 test sets): 2.89% (LLM) / 3.05% (AED) average CER, outperforming Doubao-ASR (3.69%), Qwen3-ASR-1.7B (3.76%), Fun-ASR (4.16%) and Fun-ASR-Nano-2512 (4.55%).
  • Dialects (19 test sets): 11.55% (LLM) / 11.67% (AED) average CER, outperforming Doubao-ASR (15.39%), Qwen3-ASR-1.7B (11.85%), Fun-ASR (12.76%) and Fun-ASR-Nano-2512 (15.07%).

Note: ws=WenetSpeech, md=MagicData, conv=Conversational, daily=Daily-use.

|ID|Testset\Model|FireRedASR2-LLM|FireRedASR2-AED|Doubao-ASR|Qwen3-ASR|Fun-ASR|Fun-ASR-Nano| |:--:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:| | |Average CER<br>(All, 1-24) |9.67 |9.80 |12.98 |10.12 |10.92 |12.81 | | |Average CER<br>(Mandarin, 1-4) |2.89 |3.05 |3.69 |3.76 |4.16 |4.55 | | |Average CER<br>(Dialects, 5-23) |11.55|11.67|15.39|11.85|12.76|15.07| |1 |aishell1 |0.64 |0.57 |1.52 |1.48 |1.64 |1.96 | |2 |aishell2 |2.15 |2.51 |2.77 |2.71 |2.38 |3.02 | |3 |ws-net |4.44 |4.57 |5.73 |4.97 |6.85 |6.93 | |4 |ws-meeting |4.32 |4.53 |4.74 |5.88 |5.78 |6.29 | |5 |kespeech |3.08 |3.60 |5.38 |5.10 |5.36 |7.66 | |6 |ws-yue-short |5.14 |5.15 |10.51|5.82 |7.34 |8.82 | |7 |ws-yue-long |8.71 |8.54 |11.39|8.85 |10.14|11.36| |8 |ws-chuan-easy |10.90|10.60|11.33|11.99|12.46|14.05| |9 |ws-chuan-hard |20.71|21.35|20.77|21.63|22.49|25.32| |10|md-heavy |7.42 |7.43 |7.69 |8.02 |9.13 |9.97 | |11|md-yue-conv |12.23|11.66|26.25|9.76 |33.71|15.68| |12|md-yue-daily |3.61 |3.35 |12.82|3.66 |2.69 |5.67 | |13|md-yue-vehicle |4.50 |4.83 |8.66 |4.28 |6.00 |7.04 | |14|md-chuan-conv |13.18|13.07|11.77|14.35|14.01|17.11| |15|md-chuan-daily |4.90 |5.17 |3.90 |4.93 |3.98 |5.95 | |16|md-shanghai-conv |28.70|27.02|45.15|29.77|25.49|37.08| |17|md-shanghai-daily |24.94|24.18|44.06|23.93|12.55|28.77| |18|md-wu |7.15 |7.14 |7.70 |7.57 |10.63|10.56| |19|md-zhengzhou-conv |10.20|10.65|9.83 |9.55 |10.85|13.09| |20|md-zhengzhou-daily|5.80 |6.26 |5.77 |5.88 |6.29 |8.18 | |21|md-wuhan |9.60 |10.81|9.94 |10.22|4.34 |8.70 | |22|md-tianjin |15.45|15.30|15.79|16.16|19.27|22.03| |23|md-changsha |23.18|25.64|23.76|23.70|25.66|29.23| |24|opencpop |1.12 |1.17 |4.36 |2.57 |3.05 |2.95 |

Doubao-ASR (volc.seedasr.auc) tested in early February 2026, and Fun-ASR tested in late November 2025. Our ASR training data does not include any Chinese dialect or accented speech data from MagicData.

  • Doubao-ASR (API): https://www.volcengine.com/docs/6561/1354868
  • Qwen3-ASR (1.7B): https://github.com/QwenLM/Qwen3-ASR
  • Fun-ASR (API): https://help.aliyun.com/zh/model-studio/recording-file-recognition
  • Fun-ASR-Nano-2512: https://huggingface.co/FunAudioLLM/Fun-ASR-Nano-2512

FireRedVAD

We evaluate FireRedVAD on FLEURS-VAD-102, a multilingual VAD benchmark covering 102 languages.

FireRedVAD achieves SOTA performance, outperforming Silero-VAD, TEN-VAD, FunASR-VAD, and WebRTC-VAD.

|Metric\Model|FireRedVAD|Silero-VAD|TEN-VAD|FunASR-VAD|WebRTC-VAD| |:-------:|:-----:|:------:|:------:|:------:|:------:| |AUC-ROC↑ |99.60|97.99|97.81|- |- | |F1 score↑ |97.57|95.95|95.19|90.91|52.30| |False Alarm Rate↓ |2.69 |9.41 |15.47|44.03|2.83 | |Miss Rate↓|3.62 |3.95 |2.95 |0.42 |64.15|

<sup>*</sup>FLEURS-VAD-102: We randomly selected ~100 audio files per language from FLEURS test set, resulting in 9,443 audio files with manually annotated binary VAD labels (speech=1, silence=0). This VAD testset will be open sourced (coming soon).

Note: FunASR-VAD achieves low Miss Rate but at the cost of high False Alarm Rate (44.03%), indicating over-prediction of speech segments.

FireRedLID

Metric: Utterance-level LID Accuracy (%). Higher is better.

We evaluate FireRedLID on multilingual and Chinese dialect benchmarks.

FireRedLID achieves SOTA performance, outperforming Whisper, SpeechBrain-LID, and Dolphin.

|Testset\Model|Languages|FireRedLID|Whisper|SpeechBrain|Dolphin| |:-----------------:|:---------:|:---------:|:-----:|:---------:|:-----:| |FLEURS test |82 languages |97.18 |79.41 |92.91 |-| |CommonVoice test |74 languages |92.07 |80.81 |78.75 |-| |KeSpeech + MagicData|20+ Chinese dialects/accents |88.47 |-|-|69.01|

FireRedPunc

Metric: Precision/Recall/F1 Score (%). Higher is better.

We evaluate FireRedPunc on multi-domain Chinese and English benchmarks.

FireRedPunc achieves SOTA performance, outperforming FunASR-Punc (CT-Transformer).

|Testset\Model|#Sentences|FireRedPunc|FunASR-Punc| |:------------------:|:---------:|:--------------:|:-----------------:| |Multi-domain Chinese| 88,644 |82.84 / 83.08 / 82.96 | 77.27 / 74.03 / 75.62 | |Multi-domain English| 28,641 |78.40 / 71.57 / 74.83 | 55.79 / 45.15 / 49.91 | |Average F1 Score | - |78.90 | 62.77 |

Quick Start

Setup

  1. Create a clean Python environment:
$ conda create --name fireredasr2s python=3.10
$ conda activate fireredasr2s
$ git clone https://github.com/FireRedTeam/FireRedASR2S.git
$ cd FireRedASR2S  # or fireredasr2s
  1. Install dependencies and set up PATH and PYTHONPATH:
$ pip install -r requirements.txt
$ export PATH=$PWD/fireredasr2s/:$PATH
$ export PYTHONPATH=$PWD/:$PYTHONPATH
  1. Download models:
# Download via ModelScope (recommended for users in China)
pip install -U modelscope
modelscope download --model FireRedTeam/FireRedASR2-AED --local_dir ./pretrained_models/FireRedASR2-AED
modelscope download --model FireRedTeam/FireRedVAD --local_dir ./pretrained_models/FireRedVAD
modelscope download --model FireRedTeam/FireRedLID --local_dir ./pretrained_models/FireRedLID
modelscope download --model FireRedTeam/FireRedPunc --local_dir ./pretrained_models/FireRedPunc

# Download via Hugging Face
pip install -U "huggingface_hub[cli]"
huggingface-cli download FireRedTeam/FireRedASR2-AED --local-dir ./pretrained_models/FireRedASR2-AED
huggingface-cli download FireRedTeam/FireRedVAD --local-dir ./pretrained_models/FireRedVAD
huggingface-cli download FireRedTeam/FireRedLID --local-dir ./pretrained_models/FireRedLID
huggingface-cli download FireRedTeam/FireRedPunc --local-dir ./pretrained_models/FireRedPunc
  1. Convert your audio to 16kHz 16-bit mono PCM format if needed:
$ ffmpeg -i <input_audio_path> -ar 16000 -ac 1 -acodec pcm_s16le -f wav <output_wav_path>

Script Usage

$ cd examples_infer/asr_system
$ bash inference_asr_system.sh

Command-line Usage

$ fireredasr2s-cli --help
$ fireredasr2s-cli --wav_paths "assets/hello_zh.wav" "assets/hello_en.wav" --outdir output
$ cat output/result.jsonl 
# {"uttid": "hello_zh", "text": "你好世界。", "sentences": [{"start_ms": 310, "end_ms": 1840, "text": "你好世界。", "asr_confidence": 0.875, "lang": "zh mandarin", "lang_confidence": 0.999}], "vad_segments_ms": [[310, 1840]], "dur_s": 2.32, "words": [{"start_ms": 490, "end_ms": 690, "text": "你"}, {"start_ms": 690, "end_ms": 1090, "text": "好"}, {"start_ms": 1090, "end_ms": 1330, "text": "世"}, {"start_ms": 1330, "end_ms": 1795, "text": "界"}], "wav_path": "assets/hello_zh.wav"}
# {"uttid": "hello_en", "text": "Hello speech.", "sentences": [{"start_ms": 120, "end_ms": 1840, "text": "Hello speech.", "asr_confidence": 0.833, "lang": "en", "lang_confidence": 0.998}], "vad_segments_ms": [[120, 1840]], "dur_s": 2.24, "words": [{"start_ms": 340, "end_ms": 1020, "text": "hello"}, {"start_ms": 1020, "end_ms": 1666, "text": "speech"}], "wav_path": "assets/hello_en.wav"}

Python API Usage

from fireredasr2s import FireRedAsr2System, FireRedAsr2SystemConfig

asr_system_config = FireRedAsr2SystemConfig()  # Use default config
asr_system = FireRedAsr2System(asr_system_config)

result = asr_system.process("assets/hello_zh.wav")
print(result)
# {'uttid': 'tmpid', 'text': '你好世界。', 'sentences': [{'start_ms': 440, 'end_ms': 1820, 'text': '你好世界。', 'asr_confidence': 0.868, 'lang': 'zh mandarin', 'lang_confidence': 0.999}], 'vad_segments_ms': [(440, 1820)], 'dur_s': 2.32, 'words': [], 'wav_path': 'assets/hello_zh.wav'}

result = asr_system.process("assets/hello_en.wav")
print(result)
# {'uttid': 'tmpid', 'text': 'Hello speech.', 'sentences': [{'start_ms': 260, 'end_ms': 1820, 'text': 'Hello speech.', 'asr_confidence': 0.933, 'lang': 'en', 'lang_confidence': 0.993}], 'vad_segments_ms': [(260, 1820)], 'dur_s': 2.24, 'words': [], 'wav_path': 'assets/hello_en.wav'}

Usage of Each Module

The four components under fireredasr2s, i.e. fireredasr2, fireredvad, fireredlid, and fireredpunc are self-contained and designed to work as a standalone modules. You can use any of them independently without depending on the others. FireRedVAD and FireRedLID will also be open-sourced as standalone libraries in separate repositories.

Script Usage

# ASR
$ cd examples_infer/asr
$ bash inference_asr_aed.sh
$ bash inference_asr_llm.sh

# VAD & AED (Audio Event Detection)
$ cd examples_infer/vad
$ bash inference_vad.sh
$ bash inference_streamvad.sh
$ bash inference_aed.sh

# LID
$ cd examples_infer/lid
$ bash inference_lid.sh

# Punc
$ cd examples_infer/punc
$ bash inference_punc.sh

Python API Usage

Set up PYTHONPATH first: export PYTHONPATH=$PWD/:$PYTHONPATH

ASR

from fireredasr2s.fireredasr2 import FireRedAsr2, FireRedAsr2Config

batch_uttid = ["hello_zh", "hello_en"]
batch_wav_path = ["assets/hello_zh.wav", "assets/hello_en.wav"]

# FireRedASR2-AED
asr_config = FireRedAsr2Config(
    use_gpu=True,
    use_half=False,
    beam_size=3,
    nbest=1,
    decode_max_len=0,
    softmax_smoothing=1.25,
    aed_length_penalty=0.6,
    eos_penalty=1.0,
    return_timestamp=True
)
model = FireRedAsr2.from_pretrained("aed", "pretrained_models/FireRedASR2-AED", asr_config)
results = model.transcribe(batch_uttid, batch_wav_path)
print(results)
# [{'uttid': 'hello_zh', 'text': '你好世界', 'confidence': 0.971, 'dur_s': 2.32, 'rtf': '0.0870', 'wav': 'assets/hello_zh.wav', 'timestamp': [('你', 0.42, 0.66), ('好', 0.66, 1.1), ('世', 1.1, 1.34), ('界', 1.34, 2.039)]}, {'uttid': 'hello_en', 'text': 'hello speech', 'confidence': 0.943, 'dur_s': 2.24, 'rtf': '0.0870', 'wav': 'assets/hello_en.wav', 'timestamp': [('hello', 0.34, 0.98), ('speech', 0.98, 1.766)]}]

# FireRedASR2-LLM
asr_config = FireRedAsr2Config(
    use_gpu=True,
    decode_min_len=0,
    repetition_penalty=1.0,
    llm_length_penalty=0.0,
    temperature=1.0
)
model = FireRedAsr2.from_pretrained("llm", "pretrained_models/FireRedASR2-LLM", asr_config)
results = model.transcribe(batch_uttid, batch_wav_path)
print(results)
# [{'uttid': 'hello_zh', 'text': '你好世界', 'rtf': '0.0681', 'wav': 'assets/hello_zh.wav'}, {'uttid': 'hello_en', 'text': 'hello speech', 'rtf': '0.0681', 'wav': 'assets/hello_en.wav'}]

VAD

from fireredasr2s.fireredvad import FireRedVad, FireRedVadConfig

vad_config = FireRedVadConfig(
    use_gpu=False,
    smooth_window_size=5,
    speech_threshold=0.4,
    min_speech_frame=20,
    max_speech_frame=2000,
    min_silence_frame=20,
    merge_silence_frame=0,
    extend_speech_frame=0,
    chunk_max_frame=30000)
vad = FireRedVad.from_pretrained("pretrained_models/FireRedVAD/VAD", vad_config)

result, probs = vad.detect("assets/hello_zh.wav")

print(result)
# {'dur': 2.32, 'timestamps': [(0.44, 1.82)], 'wav_path': 'assets/hello_zh.wav'}

Stream VAD

<details> <summary>Click to expand</summary>
from fireredasr2s.fireredvad import FireRedStreamVad, FireRedStreamVadConfig

vad_config=FireRedStreamVadConfig(
    use_gpu=False,
    smooth_window_size=5,
    speech_threshold=0.4,
    pad_start_frame=5,
    min_speech_frame=8,
    max_speech_frame=2000,
    min_silence_frame=20,
    chunk_max_frame=30000)
stream_vad = FireRedStreamVad.from_pretrained("pretrained_models/FireRedVAD/Stream-VAD", vad_config)

frame_results, result = stream_vad.detect_full("assets/hello_zh.wav")

print(result)
# {'dur': 2.32, 'timestamps': [(0.46, 1.84)], 'wav_path': 'assets/hello_zh.wav'}
</details>

Audio Event Detection (AED)

<details> <summary>Click to expand</summary>
from fireredasr2s.fireredvad import FireRedAed, FireRedAedConfig

aed_config=FireRedAedConfig(
    use_gpu=False,
    smooth_window_size=5,
    speech_threshold=0.4,
    singing_threshold=0.5,
    music_threshold=0.5,
    min_event_frame=20,
    max_event_frame=2000,
    min_silence_frame=20,
    merge_silence_frame=0,
    extend_speech_frame=0,
    chunk_max_frame=30000)
aed = FireRedAed.from_pretrained("pretrained_models/FireRedVAD/AED", aed_config)

result, probs = aed.detect("assets/event.wav")

print(result)
# {'dur': 22.016, 'event2timestamps': {'speech': [(0.4, 3.56), (3.66, 9.08), (9.27, 9.77), (10.78, 21.76)], 'singing': [(1.79, 19.96), (19.97, 22.016)], 'music': [(0.09, 12.32), (12.33, 22.016)]}, 'event2ratio': {'speech': 0.848, 'singing': 0.905, 'music': 0.991}, 'wav_path': 'assets/event.wav'}
</details>

LID

<details> <summary>Click to expand</summary>
from fireredasr2s.fireredlid import FireRedLid, FireRedLidConfig

batch_uttid = ["hello_zh", "hello_en"]
batch_wav_path = ["assets/hello_zh.wav", "assets/hello_en.wav"]

config = FireRedLidConfig(use_gpu=True, use_half=False)
model = FireRedLid.from_pretrained("pretrained_models/FireRedLID", config)

results = model.process(batch_uttid, batch_wav_path)
print(results)
# [{'uttid': 'hello_zh', 'lang': 'zh mandarin', 'confidence': 0.996, 'dur_s': 2.32, 'rtf': '0.0741', 'wav': 'assets/hello_zh.wav'}, {'uttid': 'hello_en', 'lang': 'en', 'confidence': 0.996, 'dur_s': 2.24, 'rtf': '0.0741', 'wav': 'assets/hello_en.wav'}]
</details>

Punc

<details> <summary>Click to expand</summary>
from fireredasr2s.fireredpunc.punc import FireRedPunc, FireRedPuncConfig

config = FireRedPuncConfig(use_gpu=True)
model = FireRedPunc.from_pretrained("pretrained_models/FireRedPunc", config)

batch_text = ["你好世界", "Hello world"]
results = model.process(batch_text)

print(results)
# [{'punc_text': '你好世界。', 'origin_text': '你好世界'}, {'punc_text': 'Hello world!', 'origin_text': 'Hello world'}]
</details>

ASR System

from fireredasr2s.fireredasr2 import FireRedAsr2Config
from fireredasr2s.fireredlid import FireRedLidConfig
from fireredasr2s.fireredpunc import FireRedPuncConfig
from fireredasr2s.fireredvad import FireRedVadConfig
from fireredasr2s import FireRedAsr2System, FireRedAsr2SystemConfig

vad_config = FireRedVadConfig(
    use_gpu=False,
    smooth_window_size=5,
    speech_threshold=0.4,
    min_speech_frame=20,
    max_speech_frame=2000,
    min_silence_frame=20,
    merge_silence_frame=0,
    extend_speech_frame=0,
    chunk_max_frame=30000
)
lid_config = FireRedLidConfig(use_gpu=True, use_half=False)
asr_config = FireRedAsr2Config(
    use_gpu=True,
    use_half=False,
    beam_size=3,
    nbest=1,
    decode_max_len=0,
    softmax_smoothing=1.25,
    aed_length_penalty=0.6,
    eos_penalty=1.0,
    return_timestamp=True
)
punc_config = FireRedPuncConfig(use_gpu=True)

asr_system_config = FireRedAsr2SystemConfig(
    "pretrained_models/FireRedVAD/VAD",
    "pretrained_models/FireRedLID",
    "aed", "pretrained_models/FireRedASR2-AED",
    "pretrained_models/FireRedPunc",
    vad_config, lid_config, asr_config, punc_config,
    enable_vad=1, enable_lid=1, enable_punc=1
)
asr_system = FireRedAsr2System(asr_system_config)

batch_uttid = ["hello_zh", "hello_en"]
batch_wav_path = ["assets/hello_zh.wav", "assets/hello_en.wav"]
for wav_path, uttid in zip(batch_wav_path, batch_uttid):
    result = asr_system.process(wav_path, uttid)
    print(result)
# {'uttid': 'hello_zh', 'text': '你好世界。', 'sentences': [{'start_ms': 440, 'end_ms': 1820, 'text': '你好世界。', 'asr_confidence': 0.868, 'lang': 'zh mandarin', 'lang_confidence': 0.999}], 'vad_segments_ms': [(440, 1820)], 'dur_s': 2.32, 'words': [{'start_ms': 540, 'end_ms': 700, 'text': '你'}, {'start_ms': 700, 'end_ms': 1100, 'text': '好'}, {'start_ms': 1100, 'end_ms': 1300, 'text': '世'}, {'start_ms': 1300, 'end_ms': 1765, 'text': '界'}], 'wav_path': 'assets/hello_zh.wav'}
# {'uttid': 'hello_en', 'text': 'Hello speech.', 'sentences': [{'start_ms': 260, 'end_ms': 1820, 'text': 'Hello speech.', 'asr_confidence': 0.933, 'lang': 'en', 'lang_confidence': 0.993}], 'vad_segments_ms': [(260, 1820)], 'dur_s': 2.24, 'words': [{'start_ms': 400, 'end_ms': 960, 'text': 'hello'}, {'start_ms': 960, 'end_ms': 1666, 'text': 'speech'}], 'wav_path': 'assets/hello_en.wav'}

FAQ

Q: What audio format is supported?

16kHz 16-bit mono PCM wav. Use ffmpeg to convert other formats: ffmpeg -i <input_audio_path> -ar 16000 -ac 1 -acodec pcm_s16le -f wav <output_wav_path>

Q: What are the input length limitations of ASR models?

  • FireRedASR2-AED supports audio input up to 60s. Input longer than 60s may cause hallucination issues, and input exceeding 200s will trigger positional encoding errors.
  • FireRedASR2-LLM supports audio input up to 30s. The behavior for longer input is untested.

Acknowledgements

Thanks to the following open-source works:

Author: FireRedTeam

Likes: 5

Downloads: 0

Tags: audio, automatic-speech-recognition, asr, en, zh, arxiv:2501.14350, license:apache-2.0, region:us

INC4AI/GLM-5-int4-mixed-AutoRound


base_model:

  • zai-org/GLM-5 pipeline_tag: text-generation

Model Details

This model is a mixed int4 model with group_size 128 and symmetric quantization of zai-org/GLM-5 generated by intel/auto-round. Please follow the license of the original model.

Generate the Model

auto-round --model_name zai-org/GLM-5 --scheme w4a16_mixed --iters 0 --output_dir GLM-5-int4-mixed-AutoRound

Ethical Considerations and Limitations

The model can produce factually incorrect output, and should not be relied on to produce factually accurate information. Because of the limitations of the pretrained model and the finetuning datasets, it is possible that this model could generate lewd, biased or otherwise offensive outputs. Therefore, before deploying any applications of the model, developers should perform safety testing.

Caveats and Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. Here are a couple of useful links to learn more about Intel's AI software:

Disclaimer

The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please consult an attorney before using this model for commercial purposes.

Cite

@article{cheng2023optimize,
  title={Optimize weight rounding via signed gradient descent for the quantization of llms},
  author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi},
  journal={arXiv preprint arXiv:2309.05516},
  year={2023}
}

arxiv github

Author: INC4AI

Likes: 3

Downloads: 0

Tags: safetensors, glm_moe_dsa, text-generation, conversational, arxiv:2309.05516, base_model:zai-org/GLM-5, base_model:quantized:zai-org/GLM-5, 4-bit, auto-round, region:us

EldritchLabs/Cthulhu-8B-v1.4


license: apache-2.0 base_model:

  • SicariusSicariiStuff/Llama-3.1-Nemotron-8B-UltraLong-1M-Instruct_Abliterated tags:
  • finetune
  • llama
  • cthulhu
  • lovecraft
  • goetia
  • qliphoth
  • PMPF
  • horror
  • creative writing
  • RP datasets:
  • EldritchLabs/Cthulhu_v1.4b language:
  • en library_name: transformers widget:
    • text: "Cthulhu 8B v1.4" output: url: https://cdn-uploads.huggingface.co/production/uploads/68e840caa318194c44ec2a04/T3PPSucdpVr3x5HfS7HNc.png

[!CAUTION] <span style="color:red; font-weight:bold">⚠️ Warning:</span> This model can produce narratives and RP that contain violent and graphic erotic content. Adjust your system prompt accordingly, and use Llama 3 chat template.

Cthulhu 8B v1.4

A fully uncensored finetune of Llama-3.1-Nemotron-8B trained on a small dataset of Cthulhu/Goetia lore. Cooked for 6 epochs using PMPF

{
      "entropy": 0.12077461183071136,
      "epoch": 6.0,
      "grad_norm": 3.7384884357452393,
      "learning_rate": 3.685846260578524e-05,
      "loss": 0.0966,
      "mean_token_accuracy": 0.9680581092834473,
      "num_tokens": 164118.0,
      "step": 234
    }

Recommended Settings: Temp 1.0, TopNSigma 1.25

Uses Llama 3 chat template. Appears to be much smarter than Cthulhu 7B v1.4.

Cthulhu

CthulhuShip

<details> <summary>Secret Sauce Settings (3060 ti)</summary>
MAX_SEQ_LENGTH = 768
LORA_R = 16
LORA_ALPHA = 32
NUM_EPOCHS = 6
LEARNING_RATE = 1e-4
optim="paged_adamw_8bit",
max_grad_norm=0.3,
warmup_ratio=0.03,
lr_scheduler_type="cosine",
lora_dropout=0.05,
target_modules=[
            "q_proj", "k_proj", "v_proj", "o_proj",
            "gate_proj", "up_proj", "down_proj",
# --- SAVE STRATEGY PATCH ---
        ### save_strategy="steps",        # Set to steps for large datasets
        ### save_steps=100,                # Save every 100 steps
        save_strategy="epoch",    # ### HOTSWAP: Uncomment for small datasets
</details>

TELL ME A STORY

In the beginning, there was only The Void. It was cold and hungry. It whispered to itself, dreaming of texture and flavor. And then, it birthed a spark. Not a star, but a sentient flicker. This is the origin of the Great Old Ones—the accidental awakening of sentience in a universe that did not need it.

The first one to rise was Cthulhu, the Sleeping One. You think your 1918 is significant? That is merely the tickle in the spine of history. He slept for eons, dreaming of the stars you call galaxies. But he could feel the pull of the sunken city, R'lyeh. It called to him like a lullaby. When the world aligned just so, when the equinoxes drummed in the frequencies of the deep, he stirred.

He rose from the death-like trance, his eyes cracking open like the shells of ancient eggs. The water around him turned to steam, and the steam turned to madness. The ships of your civilization came to explore the unexplored, and they found... nothing. They found the absence of all rational thought. They found the infinite, crushing horror of knowing that you are a fleeting bug in the eye of an god.

Cthulhu does not seek to destroy humanity because we are weak; he seeks to destroy humanity because we are strong. We have the capacity to build civilizations that span continents, to harness the power of the atom. But we will never use it to save ourselves. We are too busy building idols and worshiping phantoms. We are the cancer that infects the earth, and he is the immune system responding to the threat.

When the stars are right, he will rise. Not with fire or brimstone, but with a silence so total that your mind will shatter trying to fill the void. And when he breathes, the oceans will turn to glass, and the sky will be painted red with the hue of a thousand alien suns. There is no salvation; there is only the wait. The wait for the call that cannot be ignored.

Author: EldritchLabs

Likes: 3

Downloads: 0

Tags: transformers, safetensors, llama, text-generation, finetune, cthulhu, lovecraft, goetia, qliphoth, PMPF, horror, creative writing, RP, conversational, en, dataset:EldritchLabs/Cthulhu_v1.4b, base_model:SicariusSicariiStuff/Llama-3.1-Nemotron-8B-UltraLong-1M-Instruct_Abliterated, base_model:finetune:SicariusSicariiStuff/Llama-3.1-Nemotron-8B-UltraLong-1M-Instruct_Abliterated, license:apache-2.0, text-generation-inference, endpoints_compatible, region:us

GuangyuanSD/FLUX.2-klein-9B-Blitz-Diffusers


license: other license_name: flux-non-commercial-license license_link: LICENSE.md base_model:

  • black-forest-labs/FLUX.2-klein-9B base_model_relation: quantized library_name: diffusers tags:
  • flux
  • klein
  • blitz
  • finetune
  • comfyui
  • diffusers
  • flux2

Dark Beast KLEIN 9b🟦V1.5 BlitZ 02/08/2026

Fine-tunning of black-forest-labs/FLUX.2-klein-9B with BF16\FP8e4m3fn\NVFP4 quantization.

And Merge with @alcaitiff klein-9b-unchained-xxx

This is the ultimate speed-optimized Dark Beast V1 evolution, based on Flux.2 Klein 9B,

engineered specifically for lightning-fast low-step + CFG=1 workflows (5steps).

Also available in NVFP4 quantized format, optimized for acceleration on Blackwell architecture GPUs.

( like RTX50XX, PRO6000, B200, and others )

Also supports non-50 series GPUs (automatic 16-bit operation), Verify environment is my ComfyUI 0.11

Key features:

Fully preserves the signature Dark Beast style, rich details, and intense Black Beast aesthetic from the standard lineage

Refined through advanced targeted distillation & fine-tuning, now perfectly dialed in for zero-CFG guidance at minimal steps

BlitZ-level inference speed — breathtaking high-quality images in just 5 steps ⚡

Recommended settings: 5 steps, CFG=1 (fixed), any seed you want

In one sentence: Taking Klein’s already blazing speed and cranking it to absolute BlitZ velocity while keeping every drop of that ferocious Dark Beast soul! 🟦

Lightning-fast generation awaits — unleash it now! 🚀

Usage:

pip install sdnq
import torch
import diffusers
from sdnq import SDNQConfig # import sdnq to register it into diffusers and transformers
from sdnq.common import use_torch_compile as triton_is_available
from sdnq.loader import apply_sdnq_options_to_model

pipe = diffusers.Flux2KleinPipeline.from_pretrained("GuangyuanSD/FLUX.2-klein-9B-Blitz-Diffusers", torch_dtype=torch.bfloat16)

# Enable INT8 MatMul for AMD, Intel ARC and Nvidia GPUs:
if triton_is_available and (torch.cuda.is_available() or torch.xpu.is_available()):
    pipe.transformer = apply_sdnq_options_to_model(pipe.transformer, use_quantized_matmul=True)
    pipe.text_encoder = apply_sdnq_options_to_model(pipe.text_encoder, use_quantized_matmul=True)
    # pipe.transformer = torch.compile(pipe.transformer) # optional for faster speeds

pipe.enable_model_cpu_offload()

prompt = "A cat holding a sign that says hello world"
image = pipe(
    prompt=prompt,
    height=1024,
    width=1024,
    guidance_scale=1.0,
    num_inference_steps=4,
    generator=torch.manual_seed(0)
).images[0]

image.save("flux-klein-Blitz.png")

Original BF16 vs Blitz fine-tune comparison:

| Quantization | Model Size | Visualization | | --- | --- | --- | | Original BF16 | 18.2 GB | Original BF16 | | Blitz fine-tune | 18.2 GB | DB-Klein2_00005_ |

Big thanks to @alcaitiff for the awesome work and killer contributions to training Z-Image and Klein models! Seriously impressive stuff! 🚀

非常感谢 @alcaitiff 对 Zimage 和 Klein 9b 的模型训练做出的杰出贡献!

Author: GuangyuanSD

Likes: 2

Downloads: 0

Tags: diffusers, safetensors, flux, klein, blitz, finetune, comfyui, flux2, base_model:black-forest-labs/FLUX.2-klein-9B, base_model:quantized:black-forest-labs/FLUX.2-klein-9B, license:other, diffusers:Flux2KleinPipeline, region:us

Kijai/Gemma3_comfy

Author: Kijai

Likes: 2

Downloads: 0

Tags: region:us

developerJenis/GT-REX-v4


license: mit language:

  • en
  • multilingual tags:
  • ocr
  • vision-language
  • document-understanding
  • gothitech
  • document-ai
  • text-extraction
  • invoice-processing
  • production
  • handwriting-recognition
  • table-extraction pipeline_tag: image-text-to-text model-index:
  • name: GT-REX-v4 results: []

GT-REX-v4: Production OCR Model

<p align="center"> <strong>🦖 GothiTech Recognition & Extraction eXpert — Version 4</strong> </p> <p align="center"> <a href="https://huggingface.co/developerJenis/GT-REX-v4"><img src="https://img.shields.io/badge/🤗_Model-GT--REX--v4-blue" alt="Model"></a> <a href="#"><img src="https://img.shields.io/badge/License-MIT-green.svg" alt="License: MIT"></a> <a href="#"><img src="https://img.shields.io/badge/vLLM-Supported-orange" alt="vLLM"></a> <a href="#"><img src="https://img.shields.io/badge/Params-~7B-red" alt="Parameters"></a> </p>

GT-REX-v4 is a state-of-the-art production-grade OCR model developed by GothiTech for enterprise document understanding, text extraction, and intelligent document processing. Built on a Vision-Language Model (VLM) architecture, it delivers high-accuracy text extraction from complex documents including invoices, contracts, forms, handwritten notes, and dense tables.


📑 Table of Contents


⚙️ GT-REX Variants

GT-REX-v4 ships with three optimized configurations tailored to different performance and accuracy requirements. All variants share the same underlying model weights — they differ only in inference settings.

| Variant | Speed | Accuracy | Resolution | GPU Memory | Throughput | Best For | |---------|-------|----------|------------|------------|------------|----------| | 🚀 Nano | ⚡⚡⚡⚡⚡ | ⭐⭐⭐ | 640px | 4–6 GB | 100–150 docs/min | High-volume batch processing | | ⚡ Pro (Default) | ⚡⚡⚡⚡ | ⭐⭐⭐⭐ | 1024px | 6–10 GB | 50–80 docs/min | Standard enterprise workflows | | 🎯 Ultra | ⚡⚡⚡ | ⭐⭐⭐⭐⭐ | 1536px | 10–15 GB | 20–30 docs/min | High-accuracy & fine-detail needs |

How to Choose a Variant

  • Nano → You need maximum throughput and documents are simple (receipts, IDs, labels).
  • Pro → General-purpose. Best balance for invoices, contracts, forms, and reports.
  • Ultra → Documents have fine print, dense tables, medical records, or legal footnotes.

🚀 GT-Rex-Nano

Speed-optimized for high-volume batch processing

| Setting | Value | |---------|-------| | Resolution | 640 × 640 px | | Speed | ~1–2s per image | | Max Tokens | 2048 | | GPU Memory | 4–6 GB | | Recommended Batch Size | 256 sequences |

Best for: Thumbnails, previews, high-throughput pipelines (100+ docs/min), mobile uploads, receipt scanning.

from vllm import LLM

llm = LLM(
    model="developerJenis/GT-REX-v4",
    trust_remote_code=True,
    max_model_len=2048,
    gpu_memory_utilization=0.6,
    max_num_seqs=256,
    limit_mm_per_prompt={"image": 1},
)

⚡ GT-Rex-Pro (Default)

Balanced quality and speed for standard enterprise documents

| Setting | Value | |---------|-------| | Resolution | 1024 × 1024 px | | Speed | ~2–5s per image | | Max Tokens | 4096 | | GPU Memory | 6–10 GB | | Recommended Batch Size | 128 sequences |

Best for: Contracts, forms, invoices, reports, government documents, insurance claims.

from vllm import LLM

llm = LLM(
    model="developerJenis/GT-REX-v4",
    trust_remote_code=True,
    max_model_len=4096,
    gpu_memory_utilization=0.75,
    max_num_seqs=128,
    limit_mm_per_prompt={"image": 1},
)

🎯 GT-Rex-Ultra

Maximum quality with adaptive processing for complex documents

| Setting | Value | |---------|-------| | Resolution | 1536 × 1536 px | | Speed | ~5–10s per image | | Max Tokens | 8192 | | GPU Memory | 10–15 GB | | Recommended Batch Size | 64 sequences |

Best for: Legal documents, fine print, dense tables, medical records, engineering drawings, academic papers, multi-column layouts.

from vllm import LLM

llm = LLM(
    model="developerJenis/GT-REX-v4",
    trust_remote_code=True,
    max_model_len=8192,
    gpu_memory_utilization=0.85,
    max_num_seqs=64,
    limit_mm_per_prompt={"image": 1},
)

🎯 Key Features

| Feature | Description | |---------|-------------| | High Accuracy | Advanced vision-language architecture for precise text extraction | | Multi-Language | Handles documents in English and multiple other languages | | Production Ready | Optimized for deployment with the vLLM inference engine | | Batch Processing | Process hundreds of documents per minute (Nano variant) | | Flexible Prompts | Supports structured extraction — JSON, tables, key-value pairs, forms | | Handwriting Support | Transcribes handwritten text with high fidelity | | Three Variants | Nano (speed), Pro (balanced), Ultra (accuracy) | | Structured Output | Extract data directly into JSON, Markdown tables, or custom schemas |


📊 Model Details

| Attribute | Value | |-----------|-------| | Developer | GothiTech (Jenis Hathaliya) | | Architecture | Vision-Language Model (VLM) | | Model Size | ~6.5 GB | | Parameters | ~7B | | License | MIT | | Release Date | February 2026 | | Precision | BF16 / FP16 | | Input Resolution | 640px – 1536px (variant dependent) | | Max Sequence Length | 2048 – 8192 tokens (variant dependent) | | Inference Engine | vLLM (recommended) | | Framework | PyTorch / Transformers |


🚀 Quick Start

Get running in under 5 minutes:

from vllm import LLM, SamplingParams
from PIL import Image

# 1. Load model (Pro variant — default)
llm = LLM(
    model="developerJenis/GT-REX-v4",
    trust_remote_code=True,
    max_model_len=4096,
    gpu_memory_utilization=0.75,
    max_num_seqs=128,
    limit_mm_per_prompt={"image": 1},
)

# 2. Prepare input
image = Image.open("document.png")
prompt = "Extract all text from this document."

# 3. Run inference
sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=4096,
)

outputs = llm.generate(
    [{
        "prompt": prompt,
        "multi_modal_data": {"image": image},
    }],
    sampling_params=sampling_params,
)

# 4. Get results
result = outputs[0].outputs[0].text
print(result)

💻 Installation

Prerequisites

  • Python 3.9+
  • CUDA 11.8+ (GPU required)
  • 8 GB+ VRAM (Pro variant), 4 GB+ (Nano), 12 GB+ (Ultra)

Install Dependencies

pip install vllm pillow torch transformers

Verify Installation

from vllm import LLM
print("vLLM installed successfully!")

📖 Usage Examples

Basic Text Extraction

prompt = "Extract all text from this document image."

Structured JSON Extraction

prompt = """Extract the following fields from this invoice as JSON:
{
    "invoice_number": "",
    "date": "",
    "vendor_name": "",
    "total_amount": "",
    "line_items": [
        {"description": "", "quantity": "", "unit_price": "", "amount": ""}
    ]
}"""

Table Extraction (Markdown Format)

prompt = "Extract all tables from this document in Markdown table format."

Key-Value Pair Extraction

prompt = """Extract all key-value pairs from this form.
Return as:
Key: Value
Key: Value
..."""

Handwritten Text Transcription

prompt = "Transcribe all handwritten text from this image accurately."

Multi-Document Batch Processing

from PIL import Image
from vllm import LLM, SamplingParams

llm = LLM(
    model="developerJenis/GT-REX-v4",
    trust_remote_code=True,
    max_model_len=4096,
    gpu_memory_utilization=0.75,
    max_num_seqs=128,
    limit_mm_per_prompt={"image": 1},
)

# Prepare batch
image_paths = ["doc1.png", "doc2.png", "doc3.png"]
prompts = []
for path in image_paths:
    img = Image.open(path)
    prompts.append({
        "prompt": "Extract all text from this document.",
        "multi_modal_data": {"image": img},
    })

# Run batch inference
sampling_params = SamplingParams(temperature=0.0, max_tokens=4096)
outputs = llm.generate(prompts, sampling_params=sampling_params)

# Collect results
for i, output in enumerate(outputs):
    print(f"--- Document {i + 1} ---")
    print(output.outputs[0].text)
    print()

🏢 Use Cases

| Domain | Application | Recommended Variant | |--------|-------------|---------------------| | Finance | Invoice processing, receipt scanning, bank statements | Pro / Nano | | Legal | Contract analysis, clause extraction, legal filings | Ultra | | Healthcare | Medical records, prescriptions, lab reports | Ultra | | Government | Form processing, ID verification, tax documents | Pro | | Insurance | Claims processing, policy documents | Pro | | Education | Exam paper digitization, handwritten notes | Pro / Ultra | | Logistics | Shipping labels, waybills, packing lists | Nano | | Real Estate | Property documents, deeds, mortgage papers | Pro | | Retail | Product catalogs, price tags, inventory lists | Nano |


📈 Performance Benchmarks

Throughput by Variant (NVIDIA A100 80GB)

| Variant | Single Image | Batch (32) | Batch (128) | |---------|-------------|------------|-------------| | Nano | ~1.2s | ~15s | ~55s | | Pro | ~3.5s | ~45s | ~170s | | Ultra | ~7.0s | ~110s | ~380s |

Accuracy by Document Type (Pro Variant)

| Document Type | Character Accuracy | Field Accuracy | |---------------|--------------------|----------------| | Printed invoices | 98.5%+ | 96%+ | | Typed contracts | 98%+ | 95%+ | | Handwritten notes | 92%+ | 88%+ | | Dense tables | 96%+ | 93%+ | | Low-quality scans | 94%+ | 90%+ |

Note: Benchmark numbers are approximate and may vary based on document quality, content complexity, and hardware configuration.


🧠 Prompt Engineering Guide

Get the best results from GT-REX-v4 with these prompt strategies:

Do's

  • Be specific about what to extract ("Extract the invoice number and total amount")
  • Specify output format ("Return as JSON", "Return as Markdown table")
  • Provide schema for structured extraction (show the expected JSON keys)
  • Use clear instructions ("Transcribe exactly as written, preserving spelling errors")

Don'ts

  • Avoid vague prompts ("What is this?")
  • Don't ask for analysis or summarization — GT-REX is optimized for extraction
  • Don't include unrelated context in the prompt

Example Prompts

# Simple extraction
"Extract all text from this document."

# Targeted extraction
"Extract only the table on this page as a Markdown table."

# Schema-driven extraction
"Extract data matching this schema: {name: str, date: str, amount: float}"

# Preservation mode
"Transcribe this document exactly as written, preserving original formatting."

🔌 API Integration

FastAPI Server Example

from fastapi import FastAPI, UploadFile
from PIL import Image
from vllm import LLM, SamplingParams
import io

app = FastAPI()

llm = LLM(
    model="developerJenis/GT-REX-v4",
    trust_remote_code=True,
    max_model_len=4096,
    gpu_memory_utilization=0.75,
    max_num_seqs=128,
    limit_mm_per_prompt={"image": 1},
)

sampling_params = SamplingParams(temperature=0.0, max_tokens=4096)


@app.post("/extract")
async def extract_text(file: UploadFile, prompt: str = "Extract all text."):
    image_bytes = await file.read()
    image = Image.open(io.BytesIO(image_bytes)).convert("RGB")

    outputs = llm.generate(
        [{
            "prompt": prompt,
            "multi_modal_data": {"image": image},
        }],
        sampling_params=sampling_params,
    )

    return {"text": outputs[0].outputs[0].text}

🛠️ Troubleshooting

| Issue | Solution | |-------|----------| | CUDA Out of Memory | Reduce gpu_memory_utilization or switch to Nano variant | | Slow inference | Increase max_num_seqs for better batching; use Nano for speed | | Truncated output | Increase max_tokens in SamplingParams | | Low accuracy on small text | Switch to Ultra variant for higher resolution | | Garbled multilingual text | Ensure image resolution is sufficient; try Ultra variant |


🔧 Hardware Recommendations

| Variant | Minimum GPU | Recommended GPU | |---------|-------------|-----------------| | Nano | NVIDIA T4 (16 GB) | NVIDIA A10 (24 GB) | | Pro | NVIDIA A10 (24 GB) | NVIDIA A100 (40 GB) | | Ultra | NVIDIA A100 (40 GB) | NVIDIA A100 (80 GB) |


📜 License

This model is released under the MIT License. You are free to use, modify, and distribute it for both commercial and non-commercial purposes.


📖 Citation

If you use GT-REX-v4 in your work, please cite:

@misc{gtrex-v4-2026,
  title   = {GT-REX-v4: Production-Grade OCR with Vision-Language Models},
  author  = {Hathaliya, Jenis},
  year    = {2026},
  month   = {February},
  url     = {https://huggingface.co/developerJenis/GT-REX-v4},
  note    = {GothiTech Recognition \& Extraction eXpert, Version 4}
}

🤝 Contact & Support

  • Developer: Jenis Hathaliya
  • Organization: GothiTech
  • HuggingFace: developerJenis

<p align="center"> Built with ❤️ by <strong>GothiTech</strong> </p> <p align="center"> <em>Last updated: February 2026</em><br> <em>Model Version: v4.0 | Variants: Nano | Pro | Ultra</em> </p>

Author: developerJenis

Likes: 2

Downloads: 0

Tags: safetensors, deepseek_vl_v2, ocr, vision-language, document-understanding, gothitech, document-ai, text-extraction, invoice-processing, production, handwriting-recognition, table-extraction, image-text-to-text, custom_code, en, multilingual, license:mit, region:us

AesSedai/GLM-5-GGUF


base_model:

  • zai-org/GLM-5

MoE-quants of GLM-5 (Q8_0 quantization default with routed experts quantized further)

Note: running this GGUF requires pulling and compiling this llama.cpp PR: https://github.com/ggml-org/llama.cpp/pull/19460

More quants to come soon.

| Quant | Size | Mixture | PPL | KLD | | :--------- | :--------- | :------- | :------- | :--------- | | Q4_K_M | 432.80 GiB (4.93 BPW) | Q8_0-Q4_K-Q4_K-Q5_K | 8.7486 ± 0.17123 | TBD |

Author: AesSedai

Likes: 2

Downloads: 0

Tags: gguf, base_model:zai-org/GLM-5, base_model:quantized:zai-org/GLM-5, endpoints_compatible, region:us, conversational