Todays AI Summary

AI Developments: Qwen3 Excels, Image Editing Gets Automated, and More

Today's AI landscape is buzzing with advancements across various domains, from large language models to image editing and healthcare applications. Here's a quick rundown of the most interesting developments:

Research Highlights

  • Temporal Causal Representation Learning: A new paper introduces CaRTeD, a framework that combines temporal causal representation learning with irregular tensor decomposition. This is particularly useful for analyzing high-dimensional, irregular data like electronic health records, improving phenotyping and network recovery.
  • Kolmogorov Arnold Networks (KANs) for Imbalanced Data: An empirical study evaluates KANs for class imbalanced classification. KANs perform well on raw imbalanced data compared to MLPs, but are incompatible with conventional imbalance strategies and suffer from high computational costs.
  • Autonomous Image Editing Triplet Mining: The "NoHumansRequired" paper presents an automated pipeline for mining high-quality image editing triplets. This system uses a task-tuned Gemini validator to score instruction adherence and aesthetics, enabling large-scale training without human labeling. The authors also release NHR-Edit, a dataset of 358k high-quality triplets, and Bagel-NHR-Edit, a fine-tuned Bagel model.
  • CUDA Optimization via Reinforcement Learning: CUDA-L1, a reinforcement learning framework for CUDA optimization, achieves significant speedups on CUDA kernels. Trained on NVIDIA A100, it delivers an average speedup of x17.7 across KernelBench, demonstrating excellent portability across GPU architectures.
  • LLMs for Bridge Condition Assessment: A pilot study explores the use of Large Language Models for interpreting Non-Destructive Evaluation contour maps for bridge condition assessment. The research indicates that LLMs can improve efficiency and accuracy in bridge maintenance and safety assessments.
  • Generative AI for Human Motion Simulation: G-AI-HMS integrates text-to-text and text-to-motion models to enhance the quality of human motion simulation for industrial tasks. AI-enhanced motions showed lower error than human created descriptions in most scenarios.
  • Plain Language Adaptation of Biomedical Abstracts: The PLABA track at TREC showed promise for using Large Language Models to adapt biomedical literature for the general public, while also highlighting their deficiencies and the need for improved automatic benchmarking tools.
  • Deep Learning for Scoliosis Assessment: A multi-centre evaluation of a deep learning software for scoliosis assessment demonstrates that the software reproduces expert level Cobb angle measurements and categorical grading across multiple centres.
  • Emotion and Memory in Intelligent Systems: A study examines the relationship between perceived group emotions and group memorability in conversational interactions, finding that the observed relationship cannot be reliably distinguished from random chance.
  • Longitudinal Progress Note Generation: DENSE, a system designed to generate longitudinal progress notes, leverages a clinically informed retrieval strategy to identify temporally and semantically relevant content from both current and prior visits.

Model Updates

  • Qwen3-235B-A22B-Instruct-2507: Qwen has released an updated version of its non-thinking mode model, featuring significant improvements in instruction following, logical reasoning, text comprehension, and more. It also boasts enhanced capabilities in long-context understanding (256K). The model achieves top performance in various benchmarks, including knowledge, reasoning, coding, alignment, agent capabilities, and multilingualism.
  • Qwen3-235B-A22B-Instruct-2507-FP8: This is the FP8-quantized version of the Qwen3 model, offering similar enhancements with the benefits of FP8 quantization for potentially faster and more efficient inference.
  • OmniSVG: OmniSVG is a new family of multimodal SVG generators that leverage pre-trained Vision-Language Models (VLMs). It is capable of generating complex and detailed SVGs from text or images. The authors also introduce MMSVG-2M, a multimodal dataset with two million richly annotated SVG assets.
  • Motif Vision 6B-Preview: Motif Technologies has released a preview of its text-to-image model, trained from scratch using a MMDiT architecture and Flow Matching.

Key Takeaways

  • Qwen3's advancements: Qwen continues to push the boundaries of large language models, demonstrating impressive performance gains across a wide range of tasks.
  • Automation in image editing: The automated image editing triplet mining pipeline represents a significant step towards more efficient and scalable training of image editing models.
  • LLMs in healthcare: The application of LLMs for bridge condition assessment and longitudinal progress note generation highlights the potential of AI to improve efficiency and accuracy in healthcare and infrastructure management.
  • **Rein

AI Papers for 2026-04-29

Personalized Worked Example Generation from Student Code Submissions using Pattern-based Knowledge Components

Adaptive programming practice often relies on fixed libraries of worked examples and practice problems, which require substantial authoring effort and may not correspond well to the logical errors and partial solutions students produce while writing code. As a result, students may receive learning content that does not directly address the concepts they are working to understand, while instructors must either invest additional effort in expanding content libraries or accept a coarse level of personalization. We present an approach for knowledge-component (KC) guided educational content generation using pattern-based KCs extracted from student code. Given a problem statement and student submissions, our pipeline extracts recurring structural KC patterns from students' code through AST-based analysis and uses them to condition a generative model. In this study, we apply this approach to worked example generation, and compare baseline and KC-conditioned outputs through expert evaluation. Results suggest that KC-conditioned generation improves topical focus and relevance to learners' underlying logical errors, providing evidence that KC-based steering of generative models can support personalized learning at scale.

Learning to Think from Multiple Thinkers

We study learning with Chain-of-Thought (CoT) supervision from multiple thinkers, all of whom provide correct but possibly systematically different solutions, e.g., step-by-step solutions to math problems written by different thinkers, or step-by-step execution traces of different programs solving the same problem. We consider classes that are computationally easy to learn using CoT supervision from a single thinker, but hard to learn with only end-result supervision, i.e., without CoT (Joshi et al. 2025). We establish that, under cryptographic assumptions, learning can be hard from CoT supervision provided by two or a few different thinkers, in passive data-collection settings. On the other hand, we provide a generic computationally efficient active learning algorithm that learns with a small amount of CoT data per thinker that is completely independent of the target accuracy $\varepsilon$, a moderate number of thinkers that scales as $\log \frac{1}{\varepsilon}\log \log \frac{1}{\varepsilon}$, and sufficient passive end-result data that scales as $\frac{1}{\varepsilon}\cdot poly\log\frac{1}{\varepsilon}$.

Learning to Rotate: Temporal and Semantic Rotary Encoding for Sequential Modeling

Every Transformer architecture dedicates enormous capacity to learning rich representations in semantic embedding space -- yet the rotation manifold acted upon by Rotary Positional Embeddings (RoPE) has been treated as a fixed, hand-crafted structure, populated only by discrete ordinal indices. We argue that this rotation space is a largely overlooked second dimension of expressivity in the attention mechanism, one whose systematic exploration may open a new door for attention-based architectures. The analogy to complex numbers is instructive: just as introducing the imaginary axis -- orthogonal to and independent of the real line -- unlocked new algebraic structure once believed impossible, treating the rotation manifold as a learnable, signal-conditioned space opens an orthogonal degree of freedom in attention. In this framing, the token embedding encodes the semantic (real) component of a representation -- what a token means -- while the rotation encodes its dynamic (imaginary) component -- how it relates to every other token across time, position, and context. We introduce SIREN-RoPE, a concrete instantiation of this idea, which populates the rotation dimension with heterogeneous signals -- continuous timestamps, cyclical temporal patterns, and categorical metadata -- via a dual-branch Sinusoidal Representation Network (SIREN). As a proof of concept, we evaluate on a production-scale news feed dataset from a major social network using a generative recommender as the ranking model, demonstrating that activating this hidden dimension yields consistent improvements across calibration and ranking objectives with negligible computational overhead. We invite the community to view the rotation space not as a solved positional-encoding detail, but as an untapped axis whose rich structure may prove as consequential for attention as the imaginary unit proved for algebra.

Case-Specific Rubrics for Clinical AI Evaluation: Methodology, Validation, and LLM-Clinician Agreement Across 823 Encounters

Objective. Clinical AI documentation systems require evaluation methodologies that are clinically valid, economically viable, and sensitive to iterative changes. Methods requiring expert review per scoring instance are too slow and expensive for safe, iterative deployment. We present a case-specific, clinician-authored rubric methodology for clinical AI evaluation and examine whether LLM-generated rubrics can approximate clinician agreement. Materials and Methods. Twenty clinicians authored 1,646 rubrics for 823 clinical cases (736 real-world, 87 synthetic) across primary care, psychiatry, oncology, and behavioral health. Each rubric was validated by confirming that an LLM-based scoring agent consistently scored clinician-preferred outputs higher than rejected ones. Seven versions of an EHR-embedded AI agent for clinicians were evaluated across all cases. Results. Clinician-authored rubrics discriminated effectively between high- and low-quality outputs (median score gap: 82.9%) with high scoring stability (median range: 0.00%). Median scores improved from 84% to 95%. In later experiments, clinician-LLM ranking agreement (tau: 0.42-0.46) matched or exceeded clinician-clinician agreement (tau: 0.38-0.43), attributable to both ceiling compression and LLM rubric improvement. Discussion. This convergence supports incorporating LLM rubrics alongside clinician-authored ones. At roughly 1,000 times lower cost, LLM rubrics enable substantially greater evaluation coverage, while continued clinical authorship grounds evaluation in expert judgment. Ceiling compression poses a methodological challenge for future inter-rater agreement studies. Conclusion. Case-specific rubrics offer a path for clinical AI evaluation that preserves expert judgment while enabling automation at three orders lower cost. Clinician-authored rubrics establish the baseline against which LLM rubrics are validated.

Scalable Hyperparameter-Divergent Ensemble Training with Automatic Learning Rate Exploration for Large Models

Training large neural networks with data-parallel stochastic gradient descent allocates N GPU replicas to compute effectively identical updates -- a practice that leaves the rich space of learning rate configurations entirely unexplored during training. We propose Hyperparameter-Divergent Ensemble Training (HDET), a method that repurposes these replicas for simultaneous learning rate exploration at negligible communication overhead. HDET operates in alternating phases: a fan-out stage in which replicas train independently under a structured, symmetric spread of learning rates, and a converge stage in which parameters are averaged across all replicas via AllReduce every T steps. Building on this ensemble substrate, we further propose an automatic learning rate (auto-LR) controller that treats the relative training loss across replicas as a performance signal, updating the shared base schedule toward higher-performing configurations via a momentum-based gradient-free meta-update. The combined method produces a self-adapting learning rate schedule that improves both optimization quality and generalization without additional hyperparameter sweeps or training budget. Crucially, the framework generalizes beyond learning rate: any scalar hyperparameter that does not alter model architecture -- such as dropout rate, attention scale temperature, or weight-decay coefficient -- can be explored across replicas using the same fan-out/converge protocol, with inter-replica loss differences serving as zero-order hypergradients that guide the search direction. HDET is implemented as a drop-in replacement for PyTorch's OneCycleLR scheduler, requiring no changes to model architecture, optimizer, or data pipeline.

Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis

Large language models are widely used for code generation, yet they rely on an implicit assumption that the task descriptions are sufficiently detailed and well-formed. However, in practice, users may provide defective descriptions, which can have a strong effect on code correctness. To address this issue, we develop SpecValidator, a lightweight classifier based on a small model that has been parameter-efficiently finetuned, to automatically detect task description defects. We evaluate SpecValidator on three types of defects, Lexical Vagueness, Under-Specification and Syntax-Formatting on 3 benchmarks with task descriptions of varying structure and complexity. Our results show that SpecValidator achieves defect detection of F1 = 0.804 and MCC = 0.745, significantly outperforming GPT-5-mini (F1 = 0.469 and MCC = 0.281) and Claude Sonnet 4 (F1 = 0.518 and MCC = 0.359). Perhaps more importantly, our analysis indicates that SpecValidator can generalize to unseen issues and detect unknown Under-Specification defects in the original (real) descriptions of the benchmarks used. Our results also show that the robustness of LLMs in task description defects depends primarily on the type of defect and the characteristics of the task description, rather than the capacity of the model, with Under-Specification defects being the most severe. We further found that benchmarks with richer contextual grounding, such as LiveCodeBench, exhibit substantially greater resilience, highlighting the importance of structured task descriptions for reliable LLM-based code generation.

Green Shielding: A User-Centric Approach Towards Trustworthy AI

Large language models (LLMs) are increasingly deployed, yet their outputs can be highly sensitive to routine, non-adversarial variation in how users phrase queries, a gap not well addressed by existing red-teaming efforts. We propose Green Shielding, a user-centric agenda for building evidence-backed deployment guidance by characterizing how benign input variation shifts model behavior. We operationalize this agenda through the CUE criteria: benchmarks with authentic Context, reference standards and metrics that capture true Utility, and perturbations that reflect realistic variations in the Elicitation of model behavior. Guided by the PCS framework and developed with practicing physicians, we instantiate Green Shielding in medical diagnosis through HealthCareMagic-Diagnosis (HCM-Dx), a benchmark of patient-authored queries, together with structured reference diagnosis sets and clinically grounded metrics for evaluating differential diagnosis lists. We also study perturbation regimes that capture routine input variation and show that prompt-level factors shift model behavior along clinically meaningful dimensions. Across multiple frontier LLMs, these shifts trace out Pareto-like tradeoffs. In particular, neutralization, which removes common user-level factors while preserving clinical content, increases plausibility and yields more concise, clinician-like differentials, but reduces coverage of highly likely and safety-critical conditions. Together, these results show that interaction choices can systematically shift task-relevant properties of model outputs and support user-facing guidance for safer deployment in high-stakes domains. Although instantiated here in medical diagnosis, the agenda extends naturally to other decision-support settings and agentic AI systems.

Can Current Agents Close the Discovery-to-Application Gap? A Case Study in Minecraft

Discovering causal regularities and applying them to build functional systems--the discovery-to-application loop--is a hallmark of general intelligence, yet evaluating this capacity has been hindered by the vast complexity gap between scientific discovery and real-world engineering. We introduce SciCrafter, a Minecraft-based benchmark that operationalizes this loop through parameterized redstone circuit tasks. Agents must ignite lamps in specified patterns (e.g., simultaneously or in timed sequences); scaling target parameters substantially increases construction complexity and required knowledge, forcing genuine discovery rather than reliance on memorized solutions. Evaluating frontier models including GPT-5.2, Gemini-3-Pro, and Claude-Opus-4.5 under a general-purpose code agent scaffold, we find that all plateau at approximately 26% success rate. To diagnose these failures, we decompose the loop into four capacities--knowledge gap identification, experimental discovery, knowledge consolidation, and knowledge application--and design targeted interventions whose marginal contributions serve as proxies for corresponding gaps. Our analysis reveals that although the general knowledge application capability still remains as the biggest gap across all models, for frontier models the knowledge gap identification starts to become a major hurdle--indicating the bottleneck is shifting from solving problems right to raising the right problems for current AI. We release SciCrafter as a diagnostic probe for future research on AI systems that navigate the full discovery-to-application loop.

Governing What You Cannot Observe: Adaptive Runtime Governance for Autonomous AI Agents

Autonomous AI agents can remain fully authorized and still become unsafe as behavior drifts, adversaries adapt, and decision patterns shift without any code change. We propose the \textbf{Informational Viability Principle}: governing an agent reduces to estimating a bound on unobserved risk $\hat{B}(x) = U(x) + SB(x) + RG(x)$ and allowing an action only when its capacity $S(x)$ exceeds $\hat{B}(x)$ by a safety margin. The \textbf{Agent Viability Framework}, grounded in Aubin's viability theory, establishes three properties -- monitoring (P1), anticipation (P2), and monotonic restriction (P3) -- as individually necessary and collectively sufficient for documented failure modes. \textbf{RiskGate} instantiates the framework with dedicated statistical estimators (KL divergence, segment-vs-rest $z$-tests, sequential pattern matching), a fail-secure monotonic pipeline, and a closed-loop Autopilot formalised as an instance of Aubin's regulation map with kill-switch-as-last-resort; a scalar Viability Index $VI(t) \in [-1,+1]$ with first-order $t^*$ prediction transforms governance from reactive to predictive. Contributions are the theoretical framework, the reference implementation, and analytical coverage against published agent-failure taxonomies; quantitative empirical evaluation is scoped as follow-up work.

Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study

Large language models (LLMs) perform strongly on general-purpose code generation, yet their applicability to enterprise domain-specific languages (DSLs) remains underexplored, especially for repository-scale change generation spanning multiple files and folder structures from a single natural-language (NL) instruction. We report an industrial case study at BMW that adapts code-oriented LLMs to generate and modify project-root DSL artifacts for an Xtext-based DSL that drives downstream Java/TypeScript code generation. We develop an end-to-end pipeline for dataset construction, multi-file task representation, model adaptation, and evaluation. We encode DSL folder hierarchies as structured, path-preserving JSON, allowing single-response generation at repository scale and learning cross-file dependencies. We evaluate two instruction-tuned code LLMs (Qwen2.5-Coder and DeepSeek-Coder, 7B) under three configurations: baseline prompting, one-shot in-context learning, and parameter-efficient fine-tuning (QLoRA). Beyond standard similarity metrics, we introduce task-specific measures that assess edit correctness and repository structural fidelity. Fine-tuning yields the most significant gains across models and metrics, achieving high exact-match accuracy, substantial edit similarity, and structural fidelity of 1.00 on our held-out set for multi-file outputs. At the same time, one-shot in-context learning provides smaller but consistent improvements over baseline prompting. We further validate practical utility via an expert developer survey and an execution-based check using the existing code generator.

AI Models

inclusionAI/Ling-2.6-flash


license: mit language:

  • en

Ling-2.6-flash: Faster Responses, Stronger Execution, Higher Token Efficiency

Introduction

Today, we announce the official open-source release of Ling-2.6-flash, an instruct model with 104B total parameters and 7.4B active parameters.

As agent capabilities mature, skyrocketing token consumption has become a primary barrier to deployment. Unlike standard chat, agent workflows involve massive inputs and complex, multi-step execution, driving up both compute demand and user costs. While the industry is pivoting toward "long-reasoning" to push performance ceilings, a critical question remains: Are these excessive reasoning tokens truly necessary for high-frequency, everyday agent use cases?

Faced with mounting token pressure, Ling-2.6-flash takes a different path. Rather than relying on longer outputs to chase higher scores, it is systematically optimized for inference efficiency, token efficiency, and agent performance—aiming to stay highly competitive while being faster, leaner, and better suited for real production workloads.

At a high level, Ling-2.6-flash is built around three core strengths:

  • Hybrid linear architecture for higher inference efficiency.
    By introducing a hybrid linear architecture, we improve computational efficiency at the foundation level. On a 4× H20 setup, Ling-2.6-flash reaches inference speeds of up to 340 tokens/s. In other words, it completes tasks with significantly better cost-performance efficiency.
  • Token-efficiency optimization for a better intelligence-efficiency tradeoff.
    During training, we specifically optimized for token efficiency, with the goal of accomplishing tasks using more concise outputs. On the full Artificial Analysis evaluation suite, Ling-2.6-flash uses only 15M tokenswhile still delivering competitive performance. This translates into a meaningfully stronger intelligence-efficiency profile.
  • Targeted improvements for agent scenarios.
    For the agent use cases seeing the strongest demand today, we continuously refined Ling-2.6-flash in tool use, multi-step planning, and task execution. As a result, the model achieves performance that is competitive with, and in some cases reaches SOTA level against, models with larger active parameter counts on benchmarks including BFCL-V4, TAU2-bench, SWE-bench Verified, Claw-Eval, and PinchBench.

Evaluation

We have conducted a comprehensive evaluation of Ling-2.6-flash across multiple authoritative benchmarks. Ling-2.6-flash performs strongly on representative agent benchmarks such as BFCL-V4, TAU2-bench, SWE-bench Verified, and PinchBench. In practice, Ling-2.6-flash delivers a strong user experience across frameworks including Claude Code, Kilo Code, Qwen Code, Hermes Agent, and OpenClaw, etc.

Beyond agent tasks, Ling-2.6-flash also delivers strong performance across general knowledge,mathematical reasoning, instruction following, and long-context understanding, remains well aligned with SOTA models in the same size class.

<div align="center"> <img src="https://mdn.alipayobjects.com/huamei_3p6pd0/afts/img/KhFxSrxyF5IAAAAAgCAAAAgADryCAQFr/original" width="8001" title="" crop="0,0,1,1" id="u4a7a4034" class="ne-image"> </div> <div align="center"> <img src="https://mdn.alipayobjects.com/huamei_3p6pd0/afts/img/4bI1SK8pNM8AAAAAgBAAAAgADryCAQFr/original" width="8001" title="" crop="0,0,1,1" id="uc95688f2" class="ne-image"> </div>
  • <font style="color:rgb(38, 38, 38);">PinchBench</font><font style="color:rgb(38, 38, 38);">: Comparative scores are retrieved directly from the official PinchBench leaderboard (as of April 20, 2026), adhering to their evaluation modes (potentially Reasoning Mode). </font>
  • <font style="color:rgb(38, 38, 38);">Claw-Eval</font><font style="color:rgb(38, 38, 38);">: Comparative scores are sourced from the official Claw-Eval leaderboard (version dated 2026-03-25), adhering to their evaluation modes (potentially Reasoning Mode). Official scores for GPT-OSS-120B and GPT-5.4-mini are currently unavailable and have been omitted.</font>
  • <font style="color:rgb(38, 38, 38);">TAU2-Bench</font><font style="color:rgb(38, 38, 38);">: Evaluations are conducted using official v1.0.0 code and datasets. Following the GLM-5 evaluation protocol, we applied minor prompt adjustments in the Retail and Telecom domains to ensure users express requests clearly and to prevent premature session termination. Additionally, GPT-5.2 was utilized as the User Agent across all evaluated domains.</font>
  • <font style="color:rgb(38, 38, 38);">IFBench</font><font style="color:rgb(38, 38, 38);">: Scores for GPT-OSS-120B (low) and GPT-5.4-mini (Non-Reasoning) are sourced from the AA (Artificial Analysis) Leaderboard. All other model performance data are based on internal evaluation results.</font>

Architecture

Ling-2.6-flash continues the architectural direction introduced in Ling 2.5. Building on the Ling 2.0 foundation, we incorporate a hybrid linear attention mechanism, upgrading the original GQA attention design into a 1:7 MLA + Lightning Linear hybrid architecture through incremental training.

<div align="center"> <img src="https://mdn.alipayobjects.com/huamei_3p6pd0/afts/img/dZ9VS4RPjzAAAAAAgBAAAAgADryCAQFr/fmt.webp" width="650" title="" crop="0,0,1,1" id="u46a87a11" class="ne-image"> </div>

This combination of hybrid attention and a highly sparse MoE architecture gives Ling-2.6-flash a clear advantage in inference efficiency. Compared with mainstream SOTA models in a similar size class, Ling-2.6-flash not only delivers faster time-to-first-token, but also achieves substantially higher generation throughput in long-output scenarios. At peak, both prefill throughput and decode throughput can improve by up to around 4×.

As shown in the figure below, Ling-2.6-flash’s throughput advantage becomes more pronounced as both context length and generation length increase. More importantly, this is not just a benchmark-side gain on static metrics. In real deployment settings, the model continues to unlock stronger speed benefits as task complexity grows.

Whether the workload involves long-context understanding or extended text generation, Ling-2.6-flash preserves model capability while delivering faster responses, higher throughput, and better real-world deployment efficiency.

<div align="center"> <img src="https://mdn.alipayobjects.com/huamei_3p6pd0/afts/img/Fa_fQrVD3hcAAAAAX7AAAAgADryCAQFr/original" width="600" alt="Decode Throughput Comparison"> <p><em>Decode Throughput Comparison, 4× H20-3e, TP=4, Batch Size = 32</em></p> </div> <div align="center"> <img src="https://mdn.alipayobjects.com/huamei_3p6pd0/afts/img/LRDBTILYEooAAAAAXdAAAAgADryCAQFr/original" width="600" alt="Prefill Throughput Comparison"> <p><em>Prefill Throughput Comparison, 4× H20-3e, TP=4, Batch Size = 32</em></p> </div>

Quickstart

SGLang (Recommended)

Environment Preparation
pip install uv

uv venv ~/my_ling_env

source ~/my_ling_env/bin/activate

# uv pip "sglang-kernel>=0.4.1"
uv pip install "sglang[all]>=0.5.10.post1" --prerelease=allow
Run Inference

Both BF16 and FP8 models are supported by SGLang now. It depends on the dtype of the model in ${MODEL_PATH}. Here is the example to run Ling-2.6-flash with 4 GPUs, where the master node IP is ${MASTER_IP} and server port is ${PORT}:

Server

1. Standard Inference (Without MTP)

python -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --tp-size 4 \
    --pp-size 1 \
    --dp-size 1 \
    --trust-remote-code \
    --context-length 262144 \
    --tool-call-parser qwen25 \
    --json-model-override-args '{"rope_scaling": {"rope_type": "yarn", "factor": 2.0, "rope_theta": 6000000, "partial_rotary_factor": 0.5, "original_max_position_embeddings": 131072}}' \
    --dist-init-addr $MASTER_IP:2345 \
    --port $PORT \
    --nnodes 1

2. Inference with MTP (Multi-Token Prediction)
The current official SGLang implementation of MTP contains a bug. For better inference performance, we recommend installing our patched version. Our fix is currently under review and is expected to be merged into the official SGLang library shortly.

Install our SGLang

git clone -b ling_2_6 git@github.com:antgroup/sglang.git
cd sglang

pip install --upgrade pip
pip install -e "python"

Start server

python -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --tp-size 4 \
    --pp-size 1 \
    --dp-size 1 \
    --context-length 262144 \
    --mamba-scheduler-strategy extra_buffer \
    --speculative-algorithm NEXTN \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4 \
    --mem-fraction-static 0.75 \
    --max-running-requests 64 \
    --max-mamba-cache-size 256 \
    --tool-call-parser qwen25 \
    --json-model-override-args '{"rope_scaling": {"rope_type": "yarn", "factor": 2.0, "rope_theta": 6000000, "partial_rotary_factor": 0.5, "original_max_position_embeddings": 131072}}' \
    --trust-remote-code \
    --dist-init-addr $MASTER_IP:2345 \
    --port $PORT \
    --nnodes 1

Client

curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'

vLLM

Environment Preparation
pip install uv

uv venv ~/my_ling_env

source ~/my_ling_env/bin/activate

git clone https://github.com/vllm-project/vllm.git

cd vllm

VLLM_USE_PRECOMPILED=1 uv pip install --editable . --torch-backend=auto

Run inference

Server

vllm serve $MODEL_PATH \
    --port $PORT \
    --served-model-name my_model \
    --trust-remote-code --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.85

Client

curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'

Limitations & Future Plans

Ling-2.6-flash has already made meaningful progress in our pursuit of an extreme intelligence-efficiency tradeoff. The model has improved substantially in key areas such as tool use, multi-step planning, and long-horizon task execution. Combined with systematic optimizations in inference efficiency and interaction experience, Ling-2.6-flash is now better equipped to handle large-scale, high-frequency automated workloads, delivering stronger real-world value in production settings.

At the same time, we are fully aware that pushing intelligence efficiency to the limit comes with tradeoffs. In some highly complex scenarios, the model can still exhibit tool hallucinations due to limited reasoning depth. In addition, there is still room for improvement in areas such as natural bilingual switching between Chinese and English and compliance with highly complex instructions.

Looking ahead, we will continue exploring the frontier of intelligence efficiency. While preserving the model’s high-efficiency inference characteristics, we aim to further improve the balance between output quality and token efficiency, and to continuously strengthen the model’s stability, usability, and interaction experience across a wider range of real-world scenarios.

Author: inclusionAI

Likes: 45

Downloads: 29

Tags: safetensors, bailing_hybrid, custom_code, en, license:mit, eval-results, region:us

AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS


license: apache-2.0 base_model: AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16 language:

  • en
  • zh
  • multilingual library_name: transformers pipeline_tag: text-generation tags:
  • abliterated
  • uncensored
  • qwen3
  • qwen3.6
  • nvfp4
  • modelopt
  • mtp
  • multi-token-prediction
  • speculative-decoding
  • hybrid-attention
  • mamba
  • gated-deltanet
  • multimodal
  • aeon
  • rtx-5090
  • rtx-pro-6000
  • b100
  • b200
  • dedicated-vram-blackwell
  • sm_120
  • sm_100
  • 32gb
  • conv1d-preserved

Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS

Deployment, operations & benchmarks → github.com/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-DFlash

The GitHub repo is the source of truth for the production deployment guide, hardware-tuned docker-compose configs, full configuration reference, measured benchmarks, and AGENTS.md — an operator's manual that pre-empts common stale-documentation traps.

🙏 Reference recipe credit: The conv1d-preserved NVFP4 + MTP graft pipeline used to build this XS variant is based on sakamakismile's validated Qwen3.6-27B-NVFP4-MTP series (22K+ downloads). They worked out the modelopt config — including the strategic decision to quantize the GDN projection matmuls to NVFP4 while preserving linear_attn.conv1d at BF16 — and the MTP-head graft technique. We adapted the recipe to AEON-Ultimate's abliterated weights and ship both the conv1d-preserved-only XS variant (matching their footprint) and a heavier regular-MTP variant that additionally keeps the projections at BF16. Full credit for the underlying recipe → sakamakismile.

What "XS" means — and what it's not

This is the extra-small footprint sibling of -Multimodal-NVFP4-MTP. XS is not "everything to FP4." It is a deliberate, principled split: the heavy GDN matmul projections drop to NVFP4 (where they're bandwidth-bound and FP4 wins big), while the SSM-critical linear_attn.conv1d kernel stays BF16 (where FP4 has documented stability problems on long-context recurrence).

| | Multimodal-NVFP4-MTP (regular) | Multimodal-NVFP4-MTP-XS (this repo) | |---|---|---| | linear_attn projections (in_proj_qkv, in_proj_z, in_proj_a/b, out_proj) | preserved BF16 (~11 GB) | quantized to NVFP4 (~3 GB) | | linear_attn.conv1d (SSM 1D convolution — recurrence-critical) | preserved BF16 | preserved BF16 ✅ | | linear_attn SSM state vectors (A_log, dt_bias, norm.weight) | preserved BF16 | preserved BF16 ✅ | | mtp.* head (grafted bf16 from base, bit-exact verified) | yes | yes | | Vision tower | preserved BF16 | preserved BF16 | | Total disk | ~27 GB | ~21 GB | | VRAM footprint at runtime | ~28 GB | ~22 GB |

This is a smart, strategic quantization — not a precision compromise. The conv1d preservation matters: the GatedDeltaNet recurrence depends on the 1D convolution behaving numerically like its training distribution, and FP4 quantization of conv1d has been observed to cause drift on long-context inference in community testing. By keeping conv1d BF16 while quantizing the projections (which are bandwidth-limited matmuls where FP4 is a clean win), we get the ~6 GB footprint reduction without sacrificing the part of the model that's actually fragile under quantization. This is the same principle modelopt's NVFP4_DEFAULT_CFG applies by default and the same recipe sakamakismile validated across his Qwen3.6-NVFP4-MTP series (22K+ downloads).

When to pick which:

  • Pick the regular variant if you have ≥48 GB VRAM. Even the projection weights at BF16 give a small additional safety margin on long-context recurrence stability.
  • Pick this XS variant if you have 24–32 GB VRAM (RTX 5090, single GPUs without headroom for full BF16 GDN). The conv1d preservation guarantees the SSM recurrence stays numerically stable; the ~6 GB savings buy meaningful KV-cache headroom on tight GPUs.

We ship both because we have the headroom on RTX PRO 6000 / B100/B200 to run the larger, more numerically-conservative version, and several users on tighter cards have asked for the smaller one. Neither variant quantizes linear_attn.conv1d — that would be a different (and not-recommended) variant we have explicitly chosen not to ship.

Variants

| Format | Size | Use case | |---|---|---| | BF16 | 51 GB | Full-precision reference weights | | NVFP4 (compressed-tensors + DFlash) | 26 GB | DGX Spark — DFlash spec decode, validated | | Multimodal-NVFP4-MTP | 27 GB | RTX PRO 6000 / B100/B200 — MTP, GDN preserved BF16 | | Text-NVFP4-MTP | 26 GB | Same as above without vision tower | | Multimodal-NVFP4-MTP-XS (this repo) | 21 GB | RTX 5090 / smaller dedicated VRAM — MTP, full FP4 incl. GDN projections | | Text-NVFP4-MTP-XS | 20 GB | Same as this repo without vision tower |

What this is

The modelopt-format NVFP4 + MTP variant, multimodal-preserved, with linear_attn projections fully quantized, of AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16 — the lossless abliteration of Qwen 3.6 27B (KL 0.000492 vs base, 0/100 refusals, multimodal preserved, hybrid GDN-aware quantization).

Specifically:

  • Body quantized to NVFP4 via nvidia-modelopt 0.43.0 with NVFP4_DEFAULT_CFG. modelopt format, served by vLLM through --quantization modelopt.
  • Linear-attn / GatedDeltaNet projections quantized to NVFP4 (this is the XS difference). Only linear_attn.conv1d is kept BF16 (modelopt's default). The community has validated this approach on Qwen3.5/3.6-NVFP4 builds with 22K+ downloads on sakamakismile's reference recipes; we re-ran calibration on our abliterated weights and the model serves correctly.
  • Vision tower preserved BF16 (333 keys). Multimodal inference fully functional.
  • MTP head grafted from the base Qwen/Qwen3.6-27B checkpoint (15 tensors, BF16, bit-exact verified). Powers --speculative-config '{"method":"qwen3_5_mtp",...}' for self-speculative decoding without a separate drafter.

Why MTP

Multi-Token Prediction (MTP) lets the model predict multiple future tokens per forward pass via the trained mtp.* head, enabling speculative decoding without a separate drafter model. The acceptance rate is high because the drafter is the model itself — same architecture, same weights, same distribution.

Indicative published numbers (sakamakismile's reference recipe on RTX 5090):

  • Single-stream short prompts at n=3: ~132 tok/s
  • Single-stream long-form: ~105 tok/s
  • 2-parallel aggregate (256K + KV FP8): ~189-207 tok/s
  • Mean acceptance length: ~3.0-4.0 (compared to DFlash chains of ~2.0-2.3)

Validated benchmarks of the AEON-Ultimate XS variant land in the GitHub repo once measured.

🎯 When to pick this variant — measured hardware routing

The right speculative-decode method depends on memory architecture:

| Hardware tier | Recommended variant | Why | |---|---|---| | DGX Spark / GB10 (sm_121a, unified memory) | Either: -NVFP4 (DFlash) (simpler, validated) or this XS body served with --speculative-config '{"method":"dflash",...}' (highest measured throughput — see note below) | Spark prefers DFlash regardless of body. The XS body with DFlash spec lands at 37.6 tok/s median, 68.7 tok/s peak on Spark — the highest measured config. The grafted MTP head in this repo is unused in that path. Never use --speculative-config '{"method":"qwen3_5_mtp",...}' on Spark — that lands at only 24.1 tok/s median. | | RTX PRO 6000 Blackwell (96 GB dedicated VRAM) | Multimodal-NVFP4-MTP — GDN BF16 for best long-context fidelity, or this XS variant for ~10 % faster decode | XS measured 111.4 tok/s median vs regular's 101.5 on RTX PRO 6000. Both win against DFlash on dedicated VRAM. | | B100 / B200 (sm_100, dedicated FP4) | Multimodal-NVFP4-MTP (preferred — GDN BF16 fits) or this XS | Native FP4 + dedicated VRAM = MTP territory. Whichever fits cleanly. | | RTX 5090 (sm_120, 32 GB dedicated VRAM) | This XS variant ✅ if you use vision; Text-XS if text-only | XS variants fit comfortably in 32 GB; matches sakamakismile's reference footprint. | | A100 / H100 (no native FP4) | BF16 | NVFP4 dequantizes to BF16 on Ampere/Hopper — no benefit. |

Full bench numbers: GitHub repo Performance section. | A100 / H100 (no native FP4) | BF16 |

Usage

vLLM serve — dedicated-VRAM Blackwell (default: MTP via grafted head)

# One-time: pull this repo locally
hf download AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS \
  --local-dir ./aeon-ultimate-multimodal-nvfp4-mtp-xs

# Serve
export VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=0
export VLLM_USE_FLASHINFER_SAMPLER=1

vllm serve ./aeon-ultimate-multimodal-nvfp4-mtp-xs \
  --quantization modelopt \
  --trust-remote-code \
  --max-model-len 262144 \
  --max-num-seqs 32 \
  --max-num-batched-tokens 32768 \
  --gpu-memory-utilization 0.94 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice \
  --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}'

num_speculative_tokens=3 is the canonical setting for qwen3_5_mtp. Higher values diverge the drafter further from the target distribution and acceptance falls.

vLLM serve — DGX Spark (DFlash spec, not MTP — measured winning config)

For DGX Spark, swap the spec method to DFlash. The XS body still benefits from FP4 silicon, but DFlash's k=15 chains are decisively better than MTP's n=3 on unified memory.

# Pull the DFlash drafter alongside this body
hf download z-lab/Qwen3.6-27B-DFlash --local-dir ./qwen36-27b-dflash

vllm serve ./aeon-ultimate-multimodal-nvfp4-mtp-xs \
  --quantization modelopt \
  --trust-remote-code \
  --max-model-len 200000 \
  --max-num-seqs 16 \
  --max-num-batched-tokens 32768 \
  --gpu-memory-utilization 0.85 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice \
  --attention-backend flash_attn \
  --speculative-config '{"method":"dflash","model":"./qwen36-27b-dflash","num_speculative_tokens":15}'

Production-validated v2.1 image: ghcr.io/aeon-7/vllm-aeon-ultimate-dflash:qwen36-v2.1. Measured 37.6 tok/s median, 68.7 tok/s peak in this config — the highest single-stream config we've measured on Spark.

Configuration notes

  • --quantization modelopt is required for this body (not compressed-tensors — different format).
  • --speculative-config '{"method":"qwen3_5_mtp", ...}' uses the grafted MTP head; correct for dedicated-VRAM Blackwell. Don't use this on DGX Spark.
  • --speculative-config '{"method":"dflash", ...}' uses an external DFlash drafter; correct for DGX Spark. The grafted MTP head in this repo sits unused in this path (~0.85 GB dead weight). Don't use this on RTX PRO 6000 or B100/B200 — they prefer MTP.
  • --gpu-memory-utilization 0.94 is the validated cap on RTX PRO 6000; 0.85 is the cap on DGX Spark (unified memory thrashes higher).

Quantization recipe

  • Tool: nvidia-modelopt 0.43.0 with NVFP4_DEFAULT_CFG
  • Loader: Qwen3_5ForConditionalGeneration.from_pretrained (multimodal-preserved class)
  • Calibration: neuralmagic/calibration LLM split, 20 samples × 8192 tokens
  • Excluded from quantization (kept BF16) — XS variant differences from the regular variant in bold:
    • lm_head, proj_out.*, *router*, *mlp.gate.* (NVFP4_DEFAULT_CFG)
    • *linear_attn.conv1d*, *mixer.conv1d* (NVFP4_DEFAULT_CFG default — kept BF16 because FP4 quantization of the SSM 1D convolution causes drift on long-context recurrence; this is the recurrence-critical kernel of the GatedDeltaNet block. Both regular and XS variants preserve this.)
    • *linear_attn* is NOT broadly excluded (XS difference — the projection matmuls in_proj_qkv, in_proj_z, in_proj_a/b, out_proj get NVFP4-quantized; saves ~8 GB; FP4 is a clean win on bandwidth-bound matmuls)
    • *visual* (vision tower preservation)
    • *mtp* (MTP head preservation)
    • *output_layer*, output.*
  • MTP graft: 15 tensors copied bf16 from Qwen/Qwen3.6-27B after modelopt export
  • Pipeline: lna-lab/GGUF-to-NVFP4-SM120 reference recipe, adapted for AEON-Ultimate-BF16 input + separate MTP source

Provenance & credits

License + responsibility

Apache 2.0, inherited from Qwen/Qwen3.6-27B. This is an uncensored model. Read the full User Responsibility & Arbitration Clause on the BF16 source card before deploying. Summary: you implement downstream safety layers (input validation, output filtering, content moderation, audit logging, rate limiting, access controls, human-in-the-loop for high-risk workflows). The model has no opinions of its own — you supply the opinions, the judgment, and the ethics.

Author: AEON-7

Likes: 8

Downloads: 0

Tags: transformers, safetensors, qwen3_5, image-text-to-text, abliterated, uncensored, qwen3, qwen3.6, nvfp4, modelopt, mtp, multi-token-prediction, speculative-decoding, hybrid-attention, mamba, gated-deltanet, multimodal, aeon, rtx-5090, rtx-pro-6000, b100, b200, dedicated-vram-blackwell, sm_120, sm_100, 32gb, conv1d-preserved, text-generation, conversational, en, zh, multilingual, base_model:AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16, base_model:quantized:AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16, license:apache-2.0, endpoints_compatible, 8-bit, region:us

inclusionAI/Ling-2.6-flash-int4


license: mit language:

  • en

Ling-2.6-flash: Faster Responses, Stronger Execution, Higher Token Efficiency

Introduction

Today, we announce the official open-source release of Ling-2.6-flash, an instruct model with 104B total parameters and 7.4B active parameters.

As agent capabilities mature, skyrocketing token consumption has become a primary barrier to deployment. Unlike standard chat, agent workflows involve massive inputs and complex, multi-step execution, driving up both compute demand and user costs. While the industry is pivoting toward "long-reasoning" to push performance ceilings, a critical question remains: Are these excessive reasoning tokens truly necessary for high-frequency, everyday agent use cases?

Faced with mounting token pressure, Ling-2.6-flash takes a different path. Rather than relying on longer outputs to chase higher scores, it is systematically optimized for inference efficiency, token efficiency, and agent performance—aiming to stay highly competitive while being faster, leaner, and better suited for real production workloads.

At a high level, Ling-2.6-flash is built around three core strengths:

  • Hybrid linear architecture for higher inference efficiency.
    By introducing a hybrid linear architecture, we improve computational efficiency at the foundation level. On a 4× H20 setup, Ling-2.6-flash reaches inference speeds of up to 340 tokens/s. In other words, it completes tasks with significantly better cost-performance efficiency.
  • Token-efficiency optimization for a better intelligence-efficiency tradeoff.
    During training, we specifically optimized for token efficiency, with the goal of accomplishing tasks using more concise outputs. On the full Artificial Analysis evaluation suite, Ling-2.6-flash uses only 15M tokenswhile still delivering competitive performance. This translates into a meaningfully stronger intelligence-efficiency profile.
  • Targeted improvements for agent scenarios.
    For the agent use cases seeing the strongest demand today, we continuously refined Ling-2.6-flash in tool use, multi-step planning, and task execution. As a result, the model achieves performance that is competitive with, and in some cases reaches SOTA level against, models with larger active parameter counts on benchmarks including BFCL-V4, TAU2-bench, SWE-bench Verified, Claw-Eval, and PinchBench.

Evaluation

We have conducted a comprehensive evaluation of Ling-2.6-flash across multiple authoritative benchmarks. Ling-2.6-flash performs strongly on representative agent benchmarks such as BFCL-V4, TAU2-bench, SWE-bench Verified, and PinchBench. In practice, Ling-2.6-flash delivers a strong user experience across frameworks including Claude Code, Kilo Code, Qwen Code, Hermes Agent, and OpenClaw, etc.

Beyond agent tasks, Ling-2.6-flash also delivers strong performance across general knowledge,mathematical reasoning, instruction following, and long-context understanding, remains well aligned with SOTA models in the same size class.

<div align="center"> <img src="https://mdn.alipayobjects.com/huamei_3p6pd0/afts/img/KhFxSrxyF5IAAAAAgCAAAAgADryCAQFr/original" width="8001" title="" crop="0,0,1,1" id="u4a7a4034" class="ne-image"> </div> <div align="center"> <img src="https://mdn.alipayobjects.com/huamei_3p6pd0/afts/img/4bI1SK8pNM8AAAAAgBAAAAgADryCAQFr/original" width="8001" title="" crop="0,0,1,1" id="uc95688f2" class="ne-image"> </div>
  • <font style="color:rgb(38, 38, 38);">PinchBench</font><font style="color:rgb(38, 38, 38);">: Comparative scores are retrieved directly from the official PinchBench leaderboard (as of April 20, 2026), adhering to their evaluation modes (potentially Reasoning Mode). </font>
  • <font style="color:rgb(38, 38, 38);">Claw-Eval</font><font style="color:rgb(38, 38, 38);">: Comparative scores are sourced from the official Claw-Eval leaderboard (version dated 2026-03-25), adhering to their evaluation modes (potentially Reasoning Mode). Official scores for GPT-OSS-120B and GPT-5.4-mini are currently unavailable and have been omitted.</font>
  • <font style="color:rgb(38, 38, 38);">TAU2-Bench</font><font style="color:rgb(38, 38, 38);">: Evaluations are conducted using official v1.0.0 code and datasets. Following the GLM-5 evaluation protocol, we applied minor prompt adjustments in the Retail and Telecom domains to ensure users express requests clearly and to prevent premature session termination. Additionally, GPT-5.2 was utilized as the User Agent across all evaluated domains.</font>
  • <font style="color:rgb(38, 38, 38);">IFBench</font><font style="color:rgb(38, 38, 38);">: Scores for GPT-OSS-120B (low) and GPT-5.4-mini (Non-Reasoning) are sourced from the AA (Artificial Analysis) Leaderboard. All other model performance data are based on internal evaluation results.</font>

Quantization Robustness: FP8 and INT4

INT4 quantization was obtained by We evaluate the FP8 and INT4 quantized models using several datasets. The FP8 and INT4 quantizations are applied via the blockwise quantization and groupwise quantization, respectively.

<div align="center"> <img src="https://mdn.alipayobjects.com/huamei_3p6pd0/afts/img/8QEoRqtZhAcAAAAAQaAAAAgADryCAQFr/original" width="800" title="" crop="0,0,1,1" id="uc95688f2" class="ne-image"> </div>

Architecture

Ling-2.6-flash continues the architectural direction introduced in Ling 2.5. Building on the Ling 2.0 foundation, we incorporate a hybrid linear attention mechanism, upgrading the original GQA attention design into a 1:7 MLA + Lightning Linear hybrid architecture through incremental training.

<div align="center"> <img src="https://mdn.alipayobjects.com/huamei_3p6pd0/afts/img/dZ9VS4RPjzAAAAAAgBAAAAgADryCAQFr/fmt.webp" width="650" title="" crop="0,0,1,1" id="u46a87a11" class="ne-image"> </div>

This combination of hybrid attention and a highly sparse MoE architecture gives Ling-2.6-flash a clear advantage in inference efficiency. Compared with mainstream SOTA models in a similar size class, Ling-2.6-flash not only delivers faster time-to-first-token, but also achieves substantially higher generation throughput in long-output scenarios. At peak, both prefill throughput and decode throughput can improve by up to around 4×.

As shown in the figure below, Ling-2.6-flash’s throughput advantage becomes more pronounced as both context length and generation length increase. More importantly, this is not just a benchmark-side gain on static metrics. In real deployment settings, the model continues to unlock stronger speed benefits as task complexity grows.

Whether the workload involves long-context understanding or extended text generation, Ling-2.6-flash preserves model capability while delivering faster responses, higher throughput, and better real-world deployment efficiency.

<div align="center"> <img src="https://mdn.alipayobjects.com/huamei_3p6pd0/afts/img/Fa_fQrVD3hcAAAAAX7AAAAgADryCAQFr/original" width="600" alt="Decode Throughput Comparison"> <p><em>Decode Throughput Comparison, 4× H20-3e, TP=4, Batch Size = 32</em></p> </div> <div align="center"> <img src="https://mdn.alipayobjects.com/huamei_3p6pd0/afts/img/LRDBTILYEooAAAAAXdAAAAgADryCAQFr/original" width="600" alt="Prefill Throughput Comparison"> <p><em>Prefill Throughput Comparison, 4× H20-3e, TP=4, Batch Size = 32</em></p> </div>

Quickstart

SGLang (Recommended)

Environment Preparation
pip install uv

uv venv ~/my_ling_env

source ~/my_ling_env/bin/activate

# uv pip "sglang-kernel>=0.4.1"
uv pip install "sglang[all]>=0.5.10.post1" --prerelease=allow
Run Inference

Both BF16 and FP8 models are supported by SGLang now. It depends on the dtype of the model in ${MODEL_PATH}. Here is the example to run Ling-2.6-flash with 4 GPUs, where the master node IP is ${MASTER_IP} and server port is ${PORT}:

Server

python -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --tp-size 4 \
    --pp-size 1 \
    --dp-size 1 \
    --trust-remote-code \
    --context-length 262144 \
    --tool-call-parser qwen25 \
    --json-model-override-args '{"rope_scaling": {"rope_type": "yarn", "factor": 2.0, "rope_theta": 6000000, "partial_rotary_factor": 0.5, "original_max_position_embeddings": 131072}}' \
    --dist-init-addr $MASTER_IP:2345 \
    --port $PORT \
    --nnodes 1

Client

curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'

vLLM

Environment Preparation
pip install uv

uv venv ~/my_ling_env

source ~/my_ling_env/bin/activate

git clone https://github.com/vllm-project/vllm.git

cd vllm

VLLM_USE_PRECOMPILED=1 uv pip install --editable . --torch-backend=auto

Run inference

Server

vllm serve $MODEL_PATH \
    --port $PORT \
    --served-model-name my_model \
    --trust-remote-code --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.85

Client

curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'

Limitations & Future Plans

Ling-2.6-flash has already made meaningful progress in our pursuit of an extreme intelligence-efficiency tradeoff. The model has improved substantially in key areas such as tool use, multi-step planning, and long-horizon task execution. Combined with systematic optimizations in inference efficiency and interaction experience, Ling-2.6-flash is now better equipped to handle large-scale, high-frequency automated workloads, delivering stronger real-world value in production settings.

At the same time, we are fully aware that pushing intelligence efficiency to the limit comes with tradeoffs. In some highly complex scenarios, the model can still exhibit tool hallucinations due to limited reasoning depth. In addition, there is still room for improvement in areas such as natural bilingual switching between Chinese and English and compliance with highly complex instructions.

Looking ahead, we will continue exploring the frontier of intelligence efficiency. While preserving the model’s high-efficiency inference characteristics, we aim to further improve the balance between output quality and token efficiency, and to continuously strengthen the model’s stability, usability, and interaction experience across a wider range of real-world scenarios.

Author: inclusionAI

Likes: 6

Downloads: 0

Tags: safetensors, bailing_hybrid, custom_code, en, license:mit, compressed-tensors, region:us

oumoumad/LTX-2.3-22b-IC-LoRA-MotionDeblur

LTX-2.3 22B IC-LoRA MotionDeblur

This is a MotionDeblur IC-LoRA trained on top of LTX-2.3-22b, designed to reduce (or, when possible, remove) motion blur from input video — the per-frame smearing introduced by long shutter speeds, fast subject motion, or aggressive in-camera/post motion-blur effects. It uses the blurred clip as the conditioning input and reconstructs a sharper version of the same shot. Results vary with the severity of the original blur; heavy blur is typically softened rather than fully eliminated.

It is based on the LTX-2.3 foundation model.

Model Files

ltx-2.3-22b-ic-lora-motiondeblur.safetensors

Model Details

  • Base Model: LTX-2.3-22b
  • Training Type: IC LoRA
  • Purpose: Reduce or remove motion blur from footage
  • Training Steps: 5000

🔌 Using in ComfyUI

  1. Copy the LoRA weights into models/loras.
  2. Use the IC-LoRA workflow from the LTX-2 ComfyUI repository.
  3. Load the LoRA using the LTXICLoRALoaderModelOnly node.

License

See the LTX-2-community-license for full terms.

Author: oumoumad

Likes: 5

Downloads: 0

Tags: region:us

lewtun/talkie-1930-13b-it-hf


library_name: transformers language:

  • en base_model:
  • talkie-lm/talkie-1930-13b-it license: apache-2.0 tags:
  • talkie
  • vintage
  • historical
  • conversational

talkie-1930-13b-it (Transformers format)

This is a conversion of talkie-lm/talkie-1930-13b-it to the HuggingFace Transformers format. The original model was distributed as a raw PyTorch checkpoint with a custom inference library; this version can be loaded directly with AutoModelForCausalLM and AutoTokenizer.

The weights are numerically identical to the original — top-5 decoded tokens match across all test prompts, with max logit differences below 0.07 (bf16 rounding).

[!NOTE] This model was converted automatically by Hugging Face's ML Intern — an AI agent for ML engineering tasks. Try it yourself via the CLI or the Demo.

Table of Contents

  1. Model Summary
  2. How to Use
  3. Architecture Details
  4. Conversion Notes
  5. License

Model Summary

talkie-1930-13b-it is a 13B-parameter instruction-tuned language model from the talkie family, developed by Alec Radford, Nick Levine, and David Duvenaud. It was pretrained on 260B tokens of pre-1931 English-language text and instruction-tuned using a novel dataset extracted from vintage reference works — etiquette manuals, encyclopedias, letter-writing guides, and poetry collections. The model underwent reinforcement learning via online DPO with an LLM-as-a-judge to improve instruction following.

Read more in the talkie report.

Key Features

  • Vintage knowledge: trained exclusively on pre-1931 text, offering a unique window into early 20th-century language and thought
  • Instruction-tuned: fine-tuned for conversational use with a simple chat template
  • 13B parameters in bfloat16 (~26 GB VRAM)
  • 2048 token context window

How to Use

Installation

This model uses custom modeling code. Make sure you have a recent version of transformers installed:

pip install -U transformers torch

Basic Generation

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "lewtun/talkie-1930-13b-it-hf"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    dtype="bfloat16",
).to("cuda")

prompt = "Write an essay predicting what life will be like in the year 1960."
messages = [{"role": "user", "content": prompt}]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(**model_inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):]
print(tokenizer.decode(output_ids, skip_special_tokens=True))

Multi-turn Chat

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "lewtun/talkie-1930-13b-it-hf"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    dtype="bfloat16",
).to("cuda")

messages = [
    {"role": "user", "content": "What were the causes of the French Revolution?"},
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
reply = tokenizer.decode(output[0][len(inputs.input_ids[0]):], skip_special_tokens=True)
print(reply)

# Continue the conversation
messages.append({"role": "assistant", "content": reply})
messages.append({"role": "user", "content": "Which of those causes was the most significant?"})

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
print(tokenizer.decode(output[0][len(inputs.input_ids[0]):], skip_special_tokens=True))

Chat Template

The model uses the following chat format:

<|system|>{system_message}<|end|><|user|>{user_message}<|end|><|assistant|>{assistant_message}<|end|>

This is applied automatically when using tokenizer.apply_chat_template().

Architecture Details

talkie is a 40-layer decoder-only GPT with several distinctive architectural choices:

| Component | Details | |-----------|---------| | Parameters | 13B | | Layers | 40 | | Attention heads | 40 (MHA, no GQA) | | Hidden size | 5120 | | Head dimension | 128 | | Intermediate size (MLP) | 13696 | | Position encoding | RoPE (θ = 1,000,000) | | Activation | SwiGLU | | Normalization | RMSNorm (pre-norm) | | Context length | 2048 | | Vocabulary | 65,540 (65,535 BPE + 5 special tokens) | | Precision | bfloat16 |

Notable architectural features:

  • QK-normalization: RMSNorm is applied to queries and keys after RoPE
  • Per-head gain: learnable scalar gain per attention head, applied to queries
  • Embedding skip connections: each transformer block receives a residual connection from the (normalized) input embeddings
  • Activation gains: learnable scalar gains on attention and MLP residual streams (initialized to (2·L)^(-0.5))
  • lm_head weight gain: a learnable scalar applied to the output projection weights

Conversion Notes

This model was converted from the original talkie-lm/talkie-1930-13b-it PyTorch checkpoint using the reference talkie codebase as ground truth. The conversion involved:

  1. Model weights: the .pt state dict was remapped to a PreTrainedModel subclass (TalkieForCausalLM) and saved as safetensors
  2. Tokenizer: the tiktoken BPE vocabulary was converted to a PreTrainedTokenizerFast with the HuggingFace TikTokenConverter, including all 5 special tokens (<|endoftext|>, <|end|>, <|user|>, <|assistant|>, <|system|>)
  3. Validation: logits were compared on 4 test prompts covering chat, system prompts, and raw completion — all top-5 decoded tokens match exactly, with cosine similarity ≥ 0.99999994

Since this is a custom architecture, loading requires trust_remote_code=True.

License

Apache 2.0 — same as the original model.

Author: lewtun

Likes: 5

Downloads: 0

Tags: transformers, safetensors, talkie, text-generation, vintage, historical, conversational, custom_code, en, base_model:talkie-lm/talkie-1930-13b-it, base_model:finetune:talkie-lm/talkie-1930-13b-it, license:apache-2.0, region:us

inclusionAI/Ling-2.6-flash-fp8


license: mit language:

  • en

Ling-2.6-flash: Faster Responses, Stronger Execution, Higher Token Efficiency

Introduction

Today, we announce the official open-source release of Ling-2.6-flash, an instruct model with 104B total parameters and 7.4B active parameters.

As agent capabilities mature, skyrocketing token consumption has become a primary barrier to deployment. Unlike standard chat, agent workflows involve massive inputs and complex, multi-step execution, driving up both compute demand and user costs. While the industry is pivoting toward "long-reasoning" to push performance ceilings, a critical question remains: Are these excessive reasoning tokens truly necessary for high-frequency, everyday agent use cases?

Faced with mounting token pressure, Ling-2.6-flash takes a different path. Rather than relying on longer outputs to chase higher scores, it is systematically optimized for inference efficiency, token efficiency, and agent performance—aiming to stay highly competitive while being faster, leaner, and better suited for real production workloads.

At a high level, Ling-2.6-flash is built around three core strengths:

  • Hybrid linear architecture for higher inference efficiency.
    By introducing a hybrid linear architecture, we improve computational efficiency at the foundation level. On a 4× H20 setup, Ling-2.6-flash reaches inference speeds of up to 340 tokens/s. In other words, it completes tasks with significantly better cost-performance efficiency.
  • Token-efficiency optimization for a better intelligence-efficiency tradeoff.
    During training, we specifically optimized for token efficiency, with the goal of accomplishing tasks using more concise outputs. On the full Artificial Analysis evaluation suite, Ling-2.6-flash uses only 15M tokenswhile still delivering competitive performance. This translates into a meaningfully stronger intelligence-efficiency profile.
  • Targeted improvements for agent scenarios.
    For the agent use cases seeing the strongest demand today, we continuously refined Ling-2.6-flash in tool use, multi-step planning, and task execution. As a result, the model achieves performance that is competitive with, and in some cases reaches SOTA level against, models with larger active parameter counts on benchmarks including BFCL-V4, TAU2-bench, SWE-bench Verified, Claw-Eval, and PinchBench.

Evaluation

We have conducted a comprehensive evaluation of Ling-2.6-flash across multiple authoritative benchmarks. Ling-2.6-flash performs strongly on representative agent benchmarks such as BFCL-V4, TAU2-bench, SWE-bench Verified, and PinchBench. In practice, Ling-2.6-flash delivers a strong user experience across frameworks including Claude Code, Kilo Code, Qwen Code, Hermes Agent, and OpenClaw, etc.

Beyond agent tasks, Ling-2.6-flash also delivers strong performance across general knowledge,mathematical reasoning, in struction following, and long-context understanding, remains well aligned with SOTA models in the same size class.

<div align="center"> <img src="https://mdn.alipayobjects.com/huamei_3p6pd0/afts/img/KhFxSrxyF5IAAAAAgCAAAAgADryCAQFr/original" width="8001" title="" crop="0,0,1,1" id="u4a7a4034" class="ne-image"> </div> <div align="center"> <img src="https://mdn.alipayobjects.com/huamei_3p6pd0/afts/img/4bI1SK8pNM8AAAAAgBAAAAgADryCAQFr/original" width="8001" title="" crop="0,0,1,1" id="uc95688f2" class="ne-image"> </div>
  • <font style="color:rgb(38, 38, 38);">PinchBench</font><font style="color:rgb(38, 38, 38);">: Comparative scores are retrieved directly from the official PinchBench leaderboard (as of April 20, 2026), adhering to their evaluation modes (potentially Reasoning Mode). </font>
  • <font style="color:rgb(38, 38, 38);">Claw-Eval</font><font style="color:rgb(38, 38, 38);">: Comparative scores are sourced from the official Claw-Eval leaderboard (version dated 2026-03-25), adhering to their evaluation modes (potentially Reasoning Mode). Official scores for GPT-OSS-120B and GPT-5.4-mini are currently unavailable and have been omitted.</font>
  • <font style="color:rgb(38, 38, 38);">TAU2-Bench</font><font style="color:rgb(38, 38, 38);">: Evaluations are conducted using official v1.0.0 code and datasets. Following the GLM-5 evaluation protocol, we applied minor prompt adjustments in the Retail and Telecom domains to ensure users express requests clearly and to prevent premature session termination. Additionally, GPT-5.2 was utilized as the User Agent across all evaluated domains.</font>
  • <font style="color:rgb(38, 38, 38);">IFBench</font><font style="color:rgb(38, 38, 38);">: Scores for GPT-OSS-120B (low) and GPT-5.4-mini (Non-Reasoning) are sourced from the AA (Artificial Analysis) Leaderboard. All other model performance data are based on internal evaluation results.</font>

Quantization Robustness: FP8 and INT4

We evaluate the FP8 and INT4 quantized models using several datasets. The FP8 and INT4 quantizations are applied via the blockwise quantization and groupwise quantization, respectively.

<div align="center"> <img src="https://mdn.alipayobjects.com/huamei_3p6pd0/afts/img/8QEoRqtZhAcAAAAAQaAAAAgADryCAQFr/original" width="800" title="" crop="0,0,1,1" id="uc95688f2" class="ne-image"> </div>

Architecture

Ling-2.6-flash continues the architectural direction introduced in Ling 2.5. Building on the Ling 2.0 foundation, we incorporate a hybrid linear attention mechanism, upgrading the original GQA attention design into a 1:7 MLA + Lightning Linear hybrid architecture through incremental training.

<div align="center"> <img src="https://mdn.alipayobjects.com/huamei_3p6pd0/afts/img/dZ9VS4RPjzAAAAAAgBAAAAgADryCAQFr/fmt.webp" width="650" title="" crop="0,0,1,1" id="u46a87a11" class="ne-image"> </div>

This combination of hybrid attention and a highly sparse MoE architecture gives Ling-2.6-flash a clear advantage in inference efficiency. Compared with mainstream SOTA models in a similar size class, Ling-2.6-flash not only delivers faster time-to-first-token, but also achieves substantially higher generation throughput in long-output scenarios. At peak, both prefill throughput and decode throughput can improve by up to around 4×.

As shown in the figure below, Ling-2.6-flash’s throughput advantage becomes more pronounced as both context length and generation length increase. More importantly, this is not just a benchmark-side gain on static metrics. In real deployment settings, the model continues to unlock stronger speed benefits as task complexity grows.

Whether the workload involves long-context understanding or extended text generation, Ling-2.6-flash preserves model capability while delivering faster responses, higher throughput, and better real-world deployment efficiency.

<div align="center"> <img src="https://mdn.alipayobjects.com/huamei_3p6pd0/afts/img/Fa_fQrVD3hcAAAAAX7AAAAgADryCAQFr/original" width="600" alt="Decode Throughput Comparison"> <p><em>Decode Throughput Comparison, 4× H20-3e, TP=4, Batch Size = 32</em></p> </div> <div align="center"> <img src="https://mdn.alipayobjects.com/huamei_3p6pd0/afts/img/LRDBTILYEooAAAAAXdAAAAgADryCAQFr/original" width="600" alt="Prefill Throughput Comparison"> <p><em>Prefill Throughput Comparison, 4× H20-3e, TP=4, Batch Size = 32</em></p> </div>

Quickstart

SGLang (Recommended)

Environment Preparation
pip install uv

uv venv ~/my_ling_env

source ~/my_ling_env/bin/activate

# uv pip "sglang-kernel>=0.4.1"
uv pip install "sglang[all]>=0.5.10.post1" --prerelease=allow
Run Inference

Both BF16 and FP8 models are supported by SGLang now. It depends on the dtype of the model in ${MODEL_PATH}. Here is the example to run Ling-2.6-flash with 4 GPUs, where the master node IP is ${MASTER_IP} and server port is ${PORT}:

Server

1. Standard Inference (Without MTP)

python -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --tp-size 4 \
    --pp-size 1 \
    --dp-size 1 \
    --trust-remote-code \
    --context-length 262144 \
    --tool-call-parser qwen25 \
    --json-model-override-args '{"rope_scaling": {"rope_type": "yarn", "factor": 2.0, "rope_theta": 6000000, "partial_rotary_factor": 0.5, "original_max_position_embeddings": 131072}}' \
    --dist-init-addr $MASTER_IP:2345 \
    --port $PORT \
    --nnodes 1

2. Inference with MTP (Multi-Token Prediction)
The current official SGLang implementation of MTP contains a bug. For better inference performance, we recommend installing our patched version. Our fix is currently under review and is expected to be merged into the official SGLang library shortly.

Install our SGLang

git clone -b ling_2_6 git@github.com:antgroup/sglang.git
cd sglang

pip install --upgrade pip
pip install -e "python"

Start server

python -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --tp-size 4 \
    --pp-size 1 \
    --dp-size 1 \
    --context-length 262144 \
    --mamba-scheduler-strategy extra_buffer \
    --speculative-algorithm NEXTN \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4 \
    --mem-fraction-static 0.75 \
    --max-running-requests 64 \
    --max-mamba-cache-size 256 \
    --tool-call-parser qwen25 \
    --json-model-override-args '{"rope_scaling": {"rope_type": "yarn", "factor": 2.0, "rope_theta": 6000000, "partial_rotary_factor": 0.5, "original_max_position_embeddings": 131072}}' \
    --trust-remote-code \
    --dist-init-addr $MASTER_IP:2345 \
    --port $PORT \
    --nnodes 1

Client

curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'

vLLM

Environment Preparation
pip install uv

uv venv ~/my_ling_env

source ~/my_ling_env/bin/activate

git clone https://github.com/vllm-project/vllm.git

cd vllm

VLLM_USE_PRECOMPILED=1 uv pip install --editable . --torch-backend=auto

Run inference

Server

vllm serve $MODEL_PATH \
    --port $PORT \
    --served-model-name my_model \
    --trust-remote-code --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.85

Client

curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'

Limitations & Future Plans

Ling-2.6-flash has already made meaningful progress in our pursuit of an extreme intelligence-efficiency tradeoff. The model has improved substantially in key areas such as tool use, multi-step planning, and long-horizon task execution. Combined with systematic optimizations in inference efficiency and interaction experience, Ling-2.6-flash is now better equipped to handle large-scale, high-frequency automated workloads, delivering stronger real-world value in production settings.

At the same time, we are fully aware that pushing intelligence efficiency to the limit comes with tradeoffs. In some highly complex scenarios, the model can still exhibit tool hallucinations due to limited reasoning depth. In addition, there is still room for improvement in areas such as natural bilingual switching between Chinese and English and compliance with highly complex instructions.

Looking ahead, we will continue exploring the frontier of intelligence efficiency. While preserving the model’s high-efficiency inference characteristics, we aim to further improve the balance between output quality and token efficiency, and to continuously strengthen the model’s stability, usability, and interaction experience across a wider range of real-world scenarios.

Author: inclusionAI

Likes: 4

Downloads: 0

Tags: safetensors, bailing_hybrid, custom_code, en, license:mit, fp8, region:us

Abiray/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-GGUF


base_model: nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 library_name: gguf license: other tags:

  • nemotron
  • moe
  • mamba2
  • reasoning
  • gguf
  • llama-cpp

Nemotron-3-Nano-Omni-30B-A3B-Reasoning - GGUF

This repository contains high-fidelity GGUF quantizations of NVIDIA's Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16.

These quantizations were explicitly compiled to maximize logic, reasoning, and narrative consistency for local deployments, making them highly suitable for text-based RPG engines, structured JSON output.

🧠 High-Fidelity Quantization Strategy: FP16 Output Head

Unlike standard GGUF conversions, these models were quantized using the --leave-output-tensor flag.

What does this mean? The final projection layer (lm_head / output.weight), which maps the model's internal states to the vocabulary, has been preserved in pristine FP16 precision (~2.1 GB). While this slightly increases the overall file size and initial HDD load time, it completely eliminates the "numerical noise" introduced when crushing the output head to 4-bit or 5-bit.

The Result: Smaller quantizations (like Q3_K_M or Q4_K_M) retain the sharp logic, precise tool-calling, and chain-of-thought (<think>) capabilities of the massive uncompressed model, all while fitting comfortably into local RAM constraints.

⚙️ Model Architecture & Hardware Requirements

  • Architecture: Mamba2-Transformer Hybrid Mixture of Experts (MoE)
  • Parameters: 30 Billion total
  • Active Parameters: ~3 Billion per token (A3B)
  • Context Length: Up to 256k tokens

Because this is a sparse MoE model, it requires significantly less RAM bandwidth and compute power than a dense 30B model. A Q4_K_M variant will easily run on machines with 8GB to 16GB of system RAM. The primary bottleneck will be the initial model loading time.

📂 Available Quantizations

| Quantization | Bits / Weight | Use Case / Notes | |:---|:---:|:---| | Q8_0 | 8.5 | Extreme fidelity. Best if you have high RAM but limited VRAM. | | Q6_K | 6.5 | Excellent balance for 16GB+ systems. Near-perfect F16 parity. | | Q5_K_M | 5.5 | High quality, slightly faster inference than Q6. | | Q4_K_M | 4.8 | [RECOMMENDED] The sweet spot for performance vs. intelligence. FP16 head ensures reasoning stays intact. | | Q4_K_S | 4.5 | Slightly smaller than K_M, minimal quality loss. | | Q3_K_M | 3.5 | Maximum compression. Great for severely resource-constrained setups (8GB RAM). |

💬 Prompt Format (Reasoning Mode)

This model is trained to utilize a chain-of-thought reasoning budget. It natively supports <think> tags before generating its final response.

Chat Template Example:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is 2+2?<|im_end|>
<|im_start|>assistant
<think>
1. The user is asking for a simple arithmetic operation.
2. The operation is addition: 2 + 2.
3. The result of 2 + 2 is 4.
</think>
The answer is 4.<|im_end|>

Author: Abiray

Likes: 3

Downloads: 0

Tags: gguf, nemotron, moe, mamba2, reasoning, llama-cpp, base_model:nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16, base_model:quantized:nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16, license:other, endpoints_compatible, region:us, conversational

mlx-community/Laguna-XS.2-4bit


library_name: mlx inference: false extra_gated_description: To learn more about how we process your personal data, please read our <a href="https://poolside.ai/legal/privacy">Privacy Policy</a>. tags:

  • laguna-xs.2
  • mlx license: apache-2.0 pipeline_tag: text-generation base_model: poolside/Laguna-XS.2

mlx-community/Laguna-XS.2-4bit

This model mlx-community/Laguna-XS.2-4bit was converted to MLX format from poolside/Laguna-XS.2 using mlx-lm version 0.31.3.

Use with mlx

pip install mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Laguna-XS.2-4bit")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True, return_dict=False,
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)

Author: mlx-community

Likes: 3

Downloads: 0

Tags: mlx, safetensors, laguna, laguna-xs.2, text-generation, conversational, custom_code, base_model:poolside/Laguna-XS.2, base_model:quantized:poolside/Laguna-XS.2, license:apache-2.0, 4-bit, region:us

oumoumad/LTX-2.3-22b-IC-LoRA-Deinterlace

LTX-2.3 22B IC-LoRA Deinterlace

This is a Deinterlace IC-LoRA trained on top of LTX-2.3-22b, designed to remove temporal artifacts found in poorly converted or transcoded footage: interlace combing (true interlaced material left progressive, or shown with the wrong field order), bad linear/bilinear deinterlacing, and frame-blending ghosting from bad FPS conversion (frame averaging, mistimed interpolation). The model takes the corrupted clip as the conditioning input and tries to reconstruct a clean progressive version.

It is based on the LTX-2.3 foundation model.

⚠️ Status: still not yet well tested. Treat this checkpoint as a preview — quality on real-world degraded footage hasn't been thoroughly validated yet. Feedback welcome.

Model Files

ltx-2.3-22b-ic-lora-deinterlace.safetensors

Model Details

  • Base Model: LTX-2.3-22b
  • Training Type: IC LoRA
  • Purpose: Remove interlace combing, wrong-field-order judder, bad-deinterlace softening, and frame-blending ghosting from input video
  • Training Steps: 5000

🔌 Using in ComfyUI

  1. Copy the LoRA weights into models/loras.
  2. Use the IC-LoRA workflow from the LTX-2 ComfyUI repository.
  3. Load the LoRA using the LTXICLoRALoaderModelOnly node.

License

See the LTX-2-community-license for full terms.

Author: oumoumad

Likes: 3

Downloads: 0

Tags: region:us

cHunter789/Qwen3.6-27B-i1-IQ4_XS-GGUF


base_model: Qwen/Qwen3.6-27B language:

  • en license: apache-2.0 tags:
  • text-generation-inference
  • transformers
  • qwen
  • gguf
  • llama.cpp
  • quantized pipeline_tag: text-generation

Qwen3.6-27B-i1-IQ4_XS (Fully Optimized)

Motivation

Recent updates in the llama.cpp repository (specifically commit 1dab5f5a44) introduced a hardcoded minimum quantization of q5_K for attn_qkv layers. While this was likely intended to preserve model quality, it causes a noticeable bloat in the final file sizes.

For comparison, the highly efficient Qwen3.5-27B iq4_xs by mradermacher weighed in at 14.7GB, whereas the equivalent Qwen3.6 i1-GGUF under the new commit rules swelled to over 15.1GB.

Methodology

To restore the optimal balance of size and performance, I modified the llama.cpp source code to revert the quantization of attn_qkv layers back to a pure IQ4_XS format. This mirrors the exact 1:1 layer quantization strategy originally used in mradermacher's Qwen3.5-27B release.

This model was quantized utilizing the imatrix provided by mradermacher: Qwen3.6-27B-i1-GGUF.

Performance vs. Size Trade-off

Extensive perplexity testing (llama-perplexity with pg19.txt, 65k context, Q8_0 cache) confirms that forcing pure IQ4_XS across all layers results in a statistically insignificant intelligence drop (+0.0039 PPL) while noticeably reducing the memory footprint.

./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS.gguf -f pg19.txt -c 65536 --chunks 32 -ngl -1 -ctk q8_0 -ctv q8_0 -fa 1 -b 512 -ub 128

./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf -f pg19.txt -c 65536 --chunks 32 -ngl -1 -ctk q8_0 -ctv q8_0 -fa 1 -b 512 -ub 128

🧠 Intelligence (Perplexity) Comparison

| Model Version | Perplexity (PPL) | Difference / Quality Drop | | :--- | :--- | :--- | | Standard IQ4_XS (with q5_K attn_qkv) | 7.3765 ± 0.02760 | Baseline | | Custom IQ4_XS (pure / fully iq4) | 7.3804 ± 0.02762 | + 0.0039 (Negligible) |

Conclusion: By utilizing this custom build, users save 375 MiB of active memory and reduce the static file size closer to the 14.7GB mark, with a practically non-existent impact on output quality (~0.05% PPL variance).

Author: cHunter789

Likes: 3

Downloads: 0

Tags: transformers, gguf, text-generation-inference, qwen, llama.cpp, quantized, text-generation, en, base_model:Qwen/Qwen3.6-27B, base_model:quantized:Qwen/Qwen3.6-27B, license:apache-2.0, endpoints_compatible, region:us, imatrix, conversational