inclusionAI/Ling-2.6-flash

license: mit language:

Ling-2.6-flash: Faster Responses, Stronger Execution, Higher Token Efficiency

Introduction

Today, we announce the official open-source release of Ling-2.6-flash, an instruct model with 104B total parameters and 7.4B active parameters.

As agent capabilities mature, skyrocketing token consumption has become a primary barrier to deployment. Unlike standard chat, agent workflows involve massive inputs and complex, multi-step execution, driving up both compute demand and user costs. While the industry is pivoting toward "long-reasoning" to push performance ceilings, a critical question remains: Are these excessive reasoning tokens truly necessary for high-frequency, everyday agent use cases?

Faced with mounting token pressure, Ling-2.6-flash takes a different path. Rather than relying on longer outputs to chase higher scores, it is systematically optimized for inference efficiency, token efficiency, and agent performance—aiming to stay highly competitive while being faster, leaner, and better suited for real production workloads.

At a high level, Ling-2.6-flash is built around three core strengths:

Hybrid linear architecture for higher inference efficiency.
By introducing a hybrid linear architecture, we improve computational efficiency at the foundation level. On a 4× H20 setup, Ling-2.6-flash reaches inference speeds of up to 340 tokens/s. In other words, it completes tasks with significantly better cost-performance efficiency.
Token-efficiency optimization for a better intelligence-efficiency tradeoff.
During training, we specifically optimized for token efficiency, with the goal of accomplishing tasks using more concise outputs. On the full Artificial Analysis evaluation suite, Ling-2.6-flash uses only 15M tokenswhile still delivering competitive performance. This translates into a meaningfully stronger intelligence-efficiency profile.
Targeted improvements for agent scenarios.
For the agent use cases seeing the strongest demand today, we continuously refined Ling-2.6-flash in tool use, multi-step planning, and task execution. As a result, the model achieves performance that is competitive with, and in some cases reaches SOTA level against, models with larger active parameter counts on benchmarks including BFCL-V4, TAU2-bench, SWE-bench Verified, Claw-Eval, and PinchBench.

Evaluation

We have conducted a comprehensive evaluation of Ling-2.6-flash across multiple authoritative benchmarks. Ling-2.6-flash performs strongly on representative agent benchmarks such as BFCL-V4, TAU2-bench, SWE-bench Verified, and PinchBench. In practice, Ling-2.6-flash delivers a strong user experience across frameworks including Claude Code, Kilo Code, Qwen Code, Hermes Agent, and OpenClaw, etc.

Beyond agent tasks, Ling-2.6-flash also delivers strong performance across general knowledge,mathematical reasoning, instruction following, and long-context understanding, remains well aligned with SOTA models in the same size class.

<font style="color:rgb(38, 38, 38);">PinchBench</font><font style="color:rgb(38, 38, 38);">: Comparative scores are retrieved directly from the official PinchBench leaderboard (as of April 20, 2026), adhering to their evaluation modes (potentially Reasoning Mode). </font>

<font style="color:rgb(38, 38, 38);">Claw-Eval</font><font style="color:rgb(38, 38, 38);">: Comparative scores are sourced from the official Claw-Eval leaderboard (version dated 2026-03-25), adhering to their evaluation modes (potentially Reasoning Mode). Official scores for GPT-OSS-120B and GPT-5.4-mini are currently unavailable and have been omitted.</font>

<font style="color:rgb(38, 38, 38);">TAU2-Bench</font><font style="color:rgb(38, 38, 38);">: Evaluations are conducted using official v1.0.0 code and datasets. Following the GLM-5 evaluation protocol, we applied minor prompt adjustments in the Retail and Telecom domains to ensure users express requests clearly and to prevent premature session termination. Additionally, GPT-5.2 was utilized as the User Agent across all evaluated domains.</font>

<font style="color:rgb(38, 38, 38);">IFBench</font><font style="color:rgb(38, 38, 38);">: Scores for GPT-OSS-120B (low) and GPT-5.4-mini (Non-Reasoning) are sourced from the AA (Artificial Analysis) Leaderboard. All other model performance data are based on internal evaluation results.</font>

Architecture

Ling-2.6-flash continues the architectural direction introduced in Ling 2.5. Building on the Ling 2.0 foundation, we incorporate a hybrid linear attention mechanism, upgrading the original GQA attention design into a 1:7 MLA + Lightning Linear hybrid architecture through incremental training.

This combination of hybrid attention and a highly sparse MoE architecture gives Ling-2.6-flash a clear advantage in inference efficiency. Compared with mainstream SOTA models in a similar size class, Ling-2.6-flash not only delivers faster time-to-first-token, but also achieves substantially higher generation throughput in long-output scenarios. At peak, both prefill throughput and decode throughput can improve by up to around 4×.

As shown in the figure below, Ling-2.6-flash’s throughput advantage becomes more pronounced as both context length and generation length increase. More importantly, this is not just a benchmark-side gain on static metrics. In real deployment settings, the model continues to unlock stronger speed benefits as task complexity grows.

Whether the workload involves long-context understanding or extended text generation, Ling-2.6-flash preserves model capability while delivering faster responses, higher throughput, and better real-world deployment efficiency.

<div align="center"> <img src="https://mdn.alipayobjects.com/huamei_3p6pd0/afts/img/Fa_fQrVD3hcAAAAAX7AAAAgADryCAQFr/original" width="600" alt="Decode Throughput Comparison"> <p><em>Decode Throughput Comparison, 4× H20-3e, TP=4, Batch Size = 32</em></p> </div> <div align="center"> <img src="https://mdn.alipayobjects.com/huamei_3p6pd0/afts/img/LRDBTILYEooAAAAAXdAAAAgADryCAQFr/original" width="600" alt="Prefill Throughput Comparison"> <p><em>Prefill Throughput Comparison, 4× H20-3e, TP=4, Batch Size = 32</em></p> </div>

Quickstart

SGLang (Recommended)

Environment Preparation

pip install uv

uv venv ~/my_ling_env

source ~/my_ling_env/bin/activate

# uv pip "sglang-kernel>=0.4.1"
uv pip install "sglang[all]>=0.5.10.post1" --prerelease=allow

Run Inference

Both BF16 and FP8 models are supported by SGLang now. It depends on the dtype of the model in ${MODEL_PATH}. Here is the example to run Ling-2.6-flash with 4 GPUs, where the master node IP is ${MASTER_IP} and server port is ${PORT}:

Server

1. Standard Inference (Without MTP)

python -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --tp-size 4 \
    --pp-size 1 \
    --dp-size 1 \
    --trust-remote-code \
    --context-length 262144 \
    --tool-call-parser qwen25 \
    --json-model-override-args '{"rope_scaling": {"rope_type": "yarn", "factor": 2.0, "rope_theta": 6000000, "partial_rotary_factor": 0.5, "original_max_position_embeddings": 131072}}' \
    --dist-init-addr $MASTER_IP:2345 \
    --port $PORT \
    --nnodes 1

2. Inference with MTP (Multi-Token Prediction)
The current official SGLang implementation of MTP contains a bug. For better inference performance, we recommend installing our patched version. Our fix is currently under review and is expected to be merged into the official SGLang library shortly.

Install our SGLang

git clone -b ling_2_6 git@github.com:antgroup/sglang.git
cd sglang

pip install --upgrade pip
pip install -e "python"

Start server

python -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --tp-size 4 \
    --pp-size 1 \
    --dp-size 1 \
    --context-length 262144 \
    --mamba-scheduler-strategy extra_buffer \
    --speculative-algorithm NEXTN \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4 \
    --mem-fraction-static 0.75 \
    --max-running-requests 64 \
    --max-mamba-cache-size 256 \
    --tool-call-parser qwen25 \
    --json-model-override-args '{"rope_scaling": {"rope_type": "yarn", "factor": 2.0, "rope_theta": 6000000, "partial_rotary_factor": 0.5, "original_max_position_embeddings": 131072}}' \
    --trust-remote-code \
    --dist-init-addr $MASTER_IP:2345 \
    --port $PORT \
    --nnodes 1

Client

curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'

vLLM

Environment Preparation

pip install uv

uv venv ~/my_ling_env

source ~/my_ling_env/bin/activate

git clone https://github.com/vllm-project/vllm.git

cd vllm

VLLM_USE_PRECOMPILED=1 uv pip install --editable . --torch-backend=auto

Run inference

Server

vllm serve $MODEL_PATH \
    --port $PORT \
    --served-model-name my_model \
    --trust-remote-code --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.85

Client

curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'

Limitations & Future Plans

Ling-2.6-flash has already made meaningful progress in our pursuit of an extreme intelligence-efficiency tradeoff. The model has improved substantially in key areas such as tool use, multi-step planning, and long-horizon task execution. Combined with systematic optimizations in inference efficiency and interaction experience, Ling-2.6-flash is now better equipped to handle large-scale, high-frequency automated workloads, delivering stronger real-world value in production settings.

At the same time, we are fully aware that pushing intelligence efficiency to the limit comes with tradeoffs. In some highly complex scenarios, the model can still exhibit tool hallucinations due to limited reasoning depth. In addition, there is still room for improvement in areas such as natural bilingual switching between Chinese and English and compliance with highly complex instructions.

Looking ahead, we will continue exploring the frontier of intelligence efficiency. While preserving the model’s high-efficiency inference characteristics, we aim to further improve the balance between output quality and token efficiency, and to continuously strengthen the model’s stability, usability, and interaction experience across a wider range of real-world scenarios.

Author: inclusionAI

Likes: 45

Downloads: 29

Tags: safetensors, bailing_hybrid, custom_code, en, license:mit, eval-results, region:us

AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS

license: apache-2.0 base_model: AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16 language:

en
zh
multilingual library_name: transformers pipeline_tag: text-generation tags:
abliterated
uncensored
qwen3
qwen3.6
nvfp4
modelopt
mtp
multi-token-prediction
speculative-decoding
hybrid-attention
mamba
gated-deltanet
multimodal
aeon
rtx-5090
rtx-pro-6000
b100
b200
dedicated-vram-blackwell
sm_120
sm_100
32gb
conv1d-preserved

Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS

Deployment, operations & benchmarks → github.com/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-DFlash

The GitHub repo is the source of truth for the production deployment guide, hardware-tuned docker-compose configs, full configuration reference, measured benchmarks, and AGENTS.md — an operator's manual that pre-empts common stale-documentation traps.

🙏 Reference recipe credit: The conv1d-preserved NVFP4 + MTP graft pipeline used to build this XS variant is based on sakamakismile's validated Qwen3.6-27B-NVFP4-MTP series (22K+ downloads). They worked out the modelopt config — including the strategic decision to quantize the GDN projection matmuls to NVFP4 while preserving linear_attn.conv1d at BF16 — and the MTP-head graft technique. We adapted the recipe to AEON-Ultimate's abliterated weights and ship both the conv1d-preserved-only XS variant (matching their footprint) and a heavier regular-MTP variant that additionally keeps the projections at BF16. Full credit for the underlying recipe → sakamakismile.

What "XS" means — and what it's not

This is the extra-small footprint sibling of -Multimodal-NVFP4-MTP. XS is not "everything to FP4." It is a deliberate, principled split: the heavy GDN matmul projections drop to NVFP4 (where they're bandwidth-bound and FP4 wins big), while the SSM-critical linear_attn.conv1d kernel stays BF16 (where FP4 has documented stability problems on long-context recurrence).

| | Multimodal-NVFP4-MTP (regular) | Multimodal-NVFP4-MTP-XS (this repo) | |---|---|---| | linear_attn projections (in_proj_qkv, in_proj_z, in_proj_a/b, out_proj) | preserved BF16 (~11 GB) | quantized to NVFP4 (~3 GB) | | linear_attn.conv1d (SSM 1D convolution — recurrence-critical) | preserved BF16 | preserved BF16 ✅ | | linear_attn SSM state vectors (A_log, dt_bias, norm.weight) | preserved BF16 | preserved BF16 ✅ | | mtp.* head (grafted bf16 from base, bit-exact verified) | yes | yes | | Vision tower | preserved BF16 | preserved BF16 | | Total disk | ~27 GB | ~21 GB | | VRAM footprint at runtime | ~28 GB | ~22 GB |

This is a smart, strategic quantization — not a precision compromise. The conv1d preservation matters: the GatedDeltaNet recurrence depends on the 1D convolution behaving numerically like its training distribution, and FP4 quantization of conv1d has been observed to cause drift on long-context inference in community testing. By keeping conv1d BF16 while quantizing the projections (which are bandwidth-limited matmuls where FP4 is a clean win), we get the ~6 GB footprint reduction without sacrificing the part of the model that's actually fragile under quantization. This is the same principle modelopt's NVFP4_DEFAULT_CFG applies by default and the same recipe sakamakismile validated across his Qwen3.6-NVFP4-MTP series (22K+ downloads).

When to pick which:

Pick the regular variant if you have ≥48 GB VRAM. Even the projection weights at BF16 give a small additional safety margin on long-context recurrence stability.
Pick this XS variant if you have 24–32 GB VRAM (RTX 5090, single GPUs without headroom for full BF16 GDN). The conv1d preservation guarantees the SSM recurrence stays numerically stable; the ~6 GB savings buy meaningful KV-cache headroom on tight GPUs.

We ship both because we have the headroom on RTX PRO 6000 / B100/B200 to run the larger, more numerically-conservative version, and several users on tighter cards have asked for the smaller one. Neither variant quantizes linear_attn.conv1d — that would be a different (and not-recommended) variant we have explicitly chosen not to ship.

Variants

| Format | Size | Use case | |---|---|---| | BF16 | 51 GB | Full-precision reference weights | | NVFP4 (compressed-tensors + DFlash) | 26 GB | DGX Spark — DFlash spec decode, validated | | Multimodal-NVFP4-MTP | 27 GB | RTX PRO 6000 / B100/B200 — MTP, GDN preserved BF16 | | Text-NVFP4-MTP | 26 GB | Same as above without vision tower | | Multimodal-NVFP4-MTP-XS (this repo) | 21 GB | RTX 5090 / smaller dedicated VRAM — MTP, full FP4 incl. GDN projections | | Text-NVFP4-MTP-XS | 20 GB | Same as this repo without vision tower |

What this is

The modelopt-format NVFP4 + MTP variant, multimodal-preserved, with linear_attn projections fully quantized, of AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16 — the lossless abliteration of Qwen 3.6 27B (KL 0.000492 vs base, 0/100 refusals, multimodal preserved, hybrid GDN-aware quantization).

Specifically:

Body quantized to NVFP4 via nvidia-modelopt 0.43.0 with NVFP4_DEFAULT_CFG. modelopt format, served by vLLM through --quantization modelopt.
Linear-attn / GatedDeltaNet projections quantized to NVFP4 (this is the XS difference). Only linear_attn.conv1d is kept BF16 (modelopt's default). The community has validated this approach on Qwen3.5/3.6-NVFP4 builds with 22K+ downloads on sakamakismile's reference recipes; we re-ran calibration on our abliterated weights and the model serves correctly.
Vision tower preserved BF16 (333 keys). Multimodal inference fully functional.
MTP head grafted from the base Qwen/Qwen3.6-27B checkpoint (15 tensors, BF16, bit-exact verified). Powers --speculative-config '{"method":"qwen3_5_mtp",...}' for self-speculative decoding without a separate drafter.

Why MTP

Multi-Token Prediction (MTP) lets the model predict multiple future tokens per forward pass via the trained mtp.* head, enabling speculative decoding without a separate drafter model. The acceptance rate is high because the drafter is the model itself — same architecture, same weights, same distribution.

Indicative published numbers (sakamakismile's reference recipe on RTX 5090):

Single-stream short prompts at n=3: ~132 tok/s
Single-stream long-form: ~105 tok/s
2-parallel aggregate (256K + KV FP8): ~189-207 tok/s
Mean acceptance length: ~3.0-4.0 (compared to DFlash chains of ~2.0-2.3)

Validated benchmarks of the AEON-Ultimate XS variant land in the GitHub repo once measured.

🎯 When to pick this variant — measured hardware routing

The right speculative-decode method depends on memory architecture:

| Hardware tier | Recommended variant | Why | |---|---|---| | DGX Spark / GB10 (sm_121a, unified memory) | Either: -NVFP4 (DFlash) (simpler, validated) or this XS body served with --speculative-config '{"method":"dflash",...}' (highest measured throughput — see note below) | Spark prefers DFlash regardless of body. The XS body with DFlash spec lands at 37.6 tok/s median, 68.7 tok/s peak on Spark — the highest measured config. The grafted MTP head in this repo is unused in that path. Never use --speculative-config '{"method":"qwen3_5_mtp",...}' on Spark — that lands at only 24.1 tok/s median. | | RTX PRO 6000 Blackwell (96 GB dedicated VRAM) | Multimodal-NVFP4-MTP — GDN BF16 for best long-context fidelity, or this XS variant for ~10 % faster decode | XS measured 111.4 tok/s median vs regular's 101.5 on RTX PRO 6000. Both win against DFlash on dedicated VRAM. | | B100 / B200 (sm_100, dedicated FP4) | Multimodal-NVFP4-MTP (preferred — GDN BF16 fits) or this XS | Native FP4 + dedicated VRAM = MTP territory. Whichever fits cleanly. | | RTX 5090 (sm_120, 32 GB dedicated VRAM) | This XS variant ✅ if you use vision; Text-XS if text-only | XS variants fit comfortably in 32 GB; matches sakamakismile's reference footprint. | | A100 / H100 (no native FP4) | BF16 | NVFP4 dequantizes to BF16 on Ampere/Hopper — no benefit. |

Full bench numbers: GitHub repo Performance section. | A100 / H100 (no native FP4) | BF16 |

Usage

vLLM serve — dedicated-VRAM Blackwell (default: MTP via grafted head)

# One-time: pull this repo locally
hf download AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS \
  --local-dir ./aeon-ultimate-multimodal-nvfp4-mtp-xs

# Serve
export VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=0
export VLLM_USE_FLASHINFER_SAMPLER=1

vllm serve ./aeon-ultimate-multimodal-nvfp4-mtp-xs \
  --quantization modelopt \
  --trust-remote-code \
  --max-model-len 262144 \
  --max-num-seqs 32 \
  --max-num-batched-tokens 32768 \
  --gpu-memory-utilization 0.94 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice \
  --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}'

num_speculative_tokens=3 is the canonical setting for qwen3_5_mtp. Higher values diverge the drafter further from the target distribution and acceptance falls.

vLLM serve — DGX Spark (DFlash spec, not MTP — measured winning config)

For DGX Spark, swap the spec method to DFlash. The XS body still benefits from FP4 silicon, but DFlash's k=15 chains are decisively better than MTP's n=3 on unified memory.

# Pull the DFlash drafter alongside this body
hf download z-lab/Qwen3.6-27B-DFlash --local-dir ./qwen36-27b-dflash

vllm serve ./aeon-ultimate-multimodal-nvfp4-mtp-xs \
  --quantization modelopt \
  --trust-remote-code \
  --max-model-len 200000 \
  --max-num-seqs 16 \
  --max-num-batched-tokens 32768 \
  --gpu-memory-utilization 0.85 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice \
  --attention-backend flash_attn \
  --speculative-config '{"method":"dflash","model":"./qwen36-27b-dflash","num_speculative_tokens":15}'

Production-validated v2.1 image: ghcr.io/aeon-7/vllm-aeon-ultimate-dflash:qwen36-v2.1. Measured 37.6 tok/s median, 68.7 tok/s peak in this config — the highest single-stream config we've measured on Spark.

Configuration notes

--quantization modelopt is required for this body (not compressed-tensors — different format).
--speculative-config '{"method":"qwen3_5_mtp", ...}' uses the grafted MTP head; correct for dedicated-VRAM Blackwell. Don't use this on DGX Spark.
--speculative-config '{"method":"dflash", ...}' uses an external DFlash drafter; correct for DGX Spark. The grafted MTP head in this repo sits unused in this path (~0.85 GB dead weight). Don't use this on RTX PRO 6000 or B100/B200 — they prefer MTP.
--gpu-memory-utilization 0.94 is the validated cap on RTX PRO 6000; 0.85 is the cap on DGX Spark (unified memory thrashes higher).

Quantization recipe

Tool: nvidia-modelopt 0.43.0 with NVFP4_DEFAULT_CFG
Loader: Qwen3_5ForConditionalGeneration.from_pretrained (multimodal-preserved class)
Calibration: neuralmagic/calibration LLM split, 20 samples × 8192 tokens
Excluded from quantization (kept BF16) — XS variant differences from the regular variant in bold:
- lm_head, proj_out.*, *router*, *mlp.gate.* (NVFP4_DEFAULT_CFG)
- *linear_attn.conv1d*, *mixer.conv1d* (NVFP4_DEFAULT_CFG default — kept BF16 because FP4 quantization of the SSM 1D convolution causes drift on long-context recurrence; this is the recurrence-critical kernel of the GatedDeltaNet block. Both regular and XS variants preserve this.)
- *linear_attn* is NOT broadly excluded (XS difference — the projection matmuls in_proj_qkv, in_proj_z, in_proj_a/b, out_proj get NVFP4-quantized; saves ~8 GB; FP4 is a clean win on bandwidth-bound matmuls)
- *visual* (vision tower preservation)
- *mtp* (MTP head preservation)
- *output_layer*, output.*
MTP graft: 15 tensors copied bf16 from Qwen/Qwen3.6-27B after modelopt export
Pipeline: lna-lab/GGUF-to-NVFP4-SM120 reference recipe, adapted for AEON-Ultimate-BF16 input + separate MTP source

Provenance & credits

BF16 source: AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16. See that card for the full abliteration pipeline.
MTP graft technique: lna-lab/GGUF-to-NVFP4-SM120 (docs/MTP_GRAFT_RECIPE.md)
Reference benchmark recipes: sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP
Quantization: NVIDIA TensorRT Model Optimizer (nvidia-modelopt 0.43.0)
Base: Alibaba Qwen team — Qwen/Qwen3.6-27B

License + responsibility

Apache 2.0, inherited from Qwen/Qwen3.6-27B. This is an uncensored model. Read the full User Responsibility & Arbitration Clause on the BF16 source card before deploying. Summary: you implement downstream safety layers (input validation, output filtering, content moderation, audit logging, rate limiting, access controls, human-in-the-loop for high-risk workflows). The model has no opinions of its own — you supply the opinions, the judgment, and the ethics.

Author: AEON-7

Likes: 8

Downloads: 0

Tags: transformers, safetensors, qwen3_5, image-text-to-text, abliterated, uncensored, qwen3, qwen3.6, nvfp4, modelopt, mtp, multi-token-prediction, speculative-decoding, hybrid-attention, mamba, gated-deltanet, multimodal, aeon, rtx-5090, rtx-pro-6000, b100, b200, dedicated-vram-blackwell, sm_120, sm_100, 32gb, conv1d-preserved, text-generation, conversational, en, zh, multilingual, base_model:AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16, base_model:quantized:AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16, license:apache-2.0, endpoints_compatible, 8-bit, region:us

inclusionAI/Ling-2.6-flash-int4

license: mit language:

Ling-2.6-flash: Faster Responses, Stronger Execution, Higher Token Efficiency

Introduction

Today, we announce the official open-source release of Ling-2.6-flash, an instruct model with 104B total parameters and 7.4B active parameters.

At a high level, Ling-2.6-flash is built around three core strengths:

Hybrid linear architecture for higher inference efficiency.
By introducing a hybrid linear architecture, we improve computational efficiency at the foundation level. On a 4× H20 setup, Ling-2.6-flash reaches inference speeds of up to 340 tokens/s. In other words, it completes tasks with significantly better cost-performance efficiency.
Token-efficiency optimization for a better intelligence-efficiency tradeoff.
During training, we specifically optimized for token efficiency, with the goal of accomplishing tasks using more concise outputs. On the full Artificial Analysis evaluation suite, Ling-2.6-flash uses only 15M tokenswhile still delivering competitive performance. This translates into a meaningfully stronger intelligence-efficiency profile.
Targeted improvements for agent scenarios.
For the agent use cases seeing the strongest demand today, we continuously refined Ling-2.6-flash in tool use, multi-step planning, and task execution. As a result, the model achieves performance that is competitive with, and in some cases reaches SOTA level against, models with larger active parameter counts on benchmarks including BFCL-V4, TAU2-bench, SWE-bench Verified, Claw-Eval, and PinchBench.

Evaluation

<font style="color:rgb(38, 38, 38);">PinchBench</font><font style="color:rgb(38, 38, 38);">: Comparative scores are retrieved directly from the official PinchBench leaderboard (as of April 20, 2026), adhering to their evaluation modes (potentially Reasoning Mode). </font>

<font style="color:rgb(38, 38, 38);">Claw-Eval</font><font style="color:rgb(38, 38, 38);">: Comparative scores are sourced from the official Claw-Eval leaderboard (version dated 2026-03-25), adhering to their evaluation modes (potentially Reasoning Mode). Official scores for GPT-OSS-120B and GPT-5.4-mini are currently unavailable and have been omitted.</font>

<font style="color:rgb(38, 38, 38);">TAU2-Bench</font><font style="color:rgb(38, 38, 38);">: Evaluations are conducted using official v1.0.0 code and datasets. Following the GLM-5 evaluation protocol, we applied minor prompt adjustments in the Retail and Telecom domains to ensure users express requests clearly and to prevent premature session termination. Additionally, GPT-5.2 was utilized as the User Agent across all evaluated domains.</font>

<font style="color:rgb(38, 38, 38);">IFBench</font><font style="color:rgb(38, 38, 38);">: Scores for GPT-OSS-120B (low) and GPT-5.4-mini (Non-Reasoning) are sourced from the AA (Artificial Analysis) Leaderboard. All other model performance data are based on internal evaluation results.</font>

Quantization Robustness: FP8 and INT4

INT4 quantization was obtained by We evaluate the FP8 and INT4 quantized models using several datasets. The FP8 and INT4 quantizations are applied via the blockwise quantization and groupwise quantization, respectively.

Architecture

Quickstart

SGLang (Recommended)

Environment Preparation

pip install uv

uv venv ~/my_ling_env

source ~/my_ling_env/bin/activate

# uv pip "sglang-kernel>=0.4.1"
uv pip install "sglang[all]>=0.5.10.post1" --prerelease=allow

Run Inference

Server

python -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --tp-size 4 \
    --pp-size 1 \
    --dp-size 1 \
    --trust-remote-code \
    --context-length 262144 \
    --tool-call-parser qwen25 \
    --json-model-override-args '{"rope_scaling": {"rope_type": "yarn", "factor": 2.0, "rope_theta": 6000000, "partial_rotary_factor": 0.5, "original_max_position_embeddings": 131072}}' \
    --dist-init-addr $MASTER_IP:2345 \
    --port $PORT \
    --nnodes 1

Client

curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'

vLLM

Environment Preparation

pip install uv

uv venv ~/my_ling_env

source ~/my_ling_env/bin/activate

git clone https://github.com/vllm-project/vllm.git

cd vllm

VLLM_USE_PRECOMPILED=1 uv pip install --editable . --torch-backend=auto

Run inference

Server

vllm serve $MODEL_PATH \
    --port $PORT \
    --served-model-name my_model \
    --trust-remote-code --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.85

Client

curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'

Limitations & Future Plans

Author: inclusionAI

Likes: 6

Downloads: 0

Tags: safetensors, bailing_hybrid, custom_code, en, license:mit, compressed-tensors, region:us

oumoumad/LTX-2.3-22b-IC-LoRA-MotionDeblur

LTX-2.3 22B IC-LoRA MotionDeblur

This is a MotionDeblur IC-LoRA trained on top of LTX-2.3-22b, designed to reduce (or, when possible, remove) motion blur from input video — the per-frame smearing introduced by long shutter speeds, fast subject motion, or aggressive in-camera/post motion-blur effects. It uses the blurred clip as the conditioning input and reconstructs a sharper version of the same shot. Results vary with the severity of the original blur; heavy blur is typically softened rather than fully eliminated.

It is based on the LTX-2.3 foundation model.

Model Files

ltx-2.3-22b-ic-lora-motiondeblur.safetensors

Model Details

Base Model: LTX-2.3-22b
Training Type: IC LoRA
Purpose: Reduce or remove motion blur from footage
Training Steps: 5000

🔌 Using in ComfyUI

Copy the LoRA weights into models/loras.
Use the IC-LoRA workflow from the LTX-2 ComfyUI repository.
Load the LoRA using the LTXICLoRALoaderModelOnly node.

License

See the LTX-2-community-license for full terms.

Author: oumoumad

Likes: 5

Downloads: 0

Tags: region:us

lewtun/talkie-1930-13b-it-hf

library_name: transformers language:

en base_model:
talkie-lm/talkie-1930-13b-it license: apache-2.0 tags:
talkie
vintage
historical
conversational

talkie-1930-13b-it (Transformers format)

This is a conversion of talkie-lm/talkie-1930-13b-it to the HuggingFace Transformers format. The original model was distributed as a raw PyTorch checkpoint with a custom inference library; this version can be loaded directly with AutoModelForCausalLM and AutoTokenizer.

The weights are numerically identical to the original — top-5 decoded tokens match across all test prompts, with max logit differences below 0.07 (bf16 rounding).

[!NOTE] This model was converted automatically by Hugging Face's ML Intern — an AI agent for ML engineering tasks. Try it yourself via the CLI or the Demo.

Model Summary
How to Use
Architecture Details
Conversion Notes
License

Model Summary

talkie-1930-13b-it is a 13B-parameter instruction-tuned language model from the talkie family, developed by Alec Radford, Nick Levine, and David Duvenaud. It was pretrained on 260B tokens of pre-1931 English-language text and instruction-tuned using a novel dataset extracted from vintage reference works — etiquette manuals, encyclopedias, letter-writing guides, and poetry collections. The model underwent reinforcement learning via online DPO with an LLM-as-a-judge to improve instruction following.

Key Features

Vintage knowledge: trained exclusively on pre-1931 text, offering a unique window into early 20th-century language and thought
Instruction-tuned: fine-tuned for conversational use with a simple chat template
13B parameters in bfloat16 (~26 GB VRAM)
2048 token context window

How to Use

Installation

This model uses custom modeling code. Make sure you have a recent version of transformers installed:

pip install -U transformers torch

Basic Generation

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "lewtun/talkie-1930-13b-it-hf"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    dtype="bfloat16",
).to("cuda")

prompt = "Write an essay predicting what life will be like in the year 1960."
messages = [{"role": "user", "content": prompt}]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(**model_inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):]
print(tokenizer.decode(output_ids, skip_special_tokens=True))

Multi-turn Chat

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "lewtun/talkie-1930-13b-it-hf"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    dtype="bfloat16",
).to("cuda")

messages = [
    {"role": "user", "content": "What were the causes of the French Revolution?"},
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
reply = tokenizer.decode(output[0][len(inputs.input_ids[0]):], skip_special_tokens=True)
print(reply)

# Continue the conversation
messages.append({"role": "assistant", "content": reply})
messages.append({"role": "user", "content": "Which of those causes was the most significant?"})

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
print(tokenizer.decode(output[0][len(inputs.input_ids[0]):], skip_special_tokens=True))

Chat Template

The model uses the following chat format:

<|system|>{system_message}<|end|><|user|>{user_message}<|end|><|assistant|>{assistant_message}<|end|>

This is applied automatically when using tokenizer.apply_chat_template().

Architecture Details

talkie is a 40-layer decoder-only GPT with several distinctive architectural choices:

| Component | Details | |-----------|---------| | Parameters | 13B | | Layers | 40 | | Attention heads | 40 (MHA, no GQA) | | Hidden size | 5120 | | Head dimension | 128 | | Intermediate size (MLP) | 13696 | | Position encoding | RoPE (θ = 1,000,000) | | Activation | SwiGLU | | Normalization | RMSNorm (pre-norm) | | Context length | 2048 | | Vocabulary | 65,540 (65,535 BPE + 5 special tokens) | | Precision | bfloat16 |

Notable architectural features:

QK-normalization: RMSNorm is applied to queries and keys after RoPE
Per-head gain: learnable scalar gain per attention head, applied to queries
Embedding skip connections: each transformer block receives a residual connection from the (normalized) input embeddings
Activation gains: learnable scalar gains on attention and MLP residual streams (initialized to (2·L)^(-0.5))
lm_head weight gain: a learnable scalar applied to the output projection weights

Conversion Notes

This model was converted from the original talkie-lm/talkie-1930-13b-it PyTorch checkpoint using the reference talkie codebase as ground truth. The conversion involved:

Model weights: the .pt state dict was remapped to a PreTrainedModel subclass (TalkieForCausalLM) and saved as safetensors
Tokenizer: the tiktoken BPE vocabulary was converted to a PreTrainedTokenizerFast with the HuggingFace TikTokenConverter, including all 5 special tokens (<|endoftext|>, <|end|>, <|user|>, <|assistant|>, <|system|>)
Validation: logits were compared on 4 test prompts covering chat, system prompts, and raw completion — all top-5 decoded tokens match exactly, with cosine similarity ≥ 0.99999994

Since this is a custom architecture, loading requires trust_remote_code=True.

License

Apache 2.0 — same as the original model.

Author: lewtun

Likes: 5

Downloads: 0

Tags: transformers, safetensors, talkie, text-generation, vintage, historical, conversational, custom_code, en, base_model:talkie-lm/talkie-1930-13b-it, base_model:finetune:talkie-lm/talkie-1930-13b-it, license:apache-2.0, region:us

inclusionAI/Ling-2.6-flash-fp8

license: mit language:

Ling-2.6-flash: Faster Responses, Stronger Execution, Higher Token Efficiency

Introduction

Today, we announce the official open-source release of Ling-2.6-flash, an instruct model with 104B total parameters and 7.4B active parameters.

At a high level, Ling-2.6-flash is built around three core strengths:

Hybrid linear architecture for higher inference efficiency.
By introducing a hybrid linear architecture, we improve computational efficiency at the foundation level. On a 4× H20 setup, Ling-2.6-flash reaches inference speeds of up to 340 tokens/s. In other words, it completes tasks with significantly better cost-performance efficiency.
Token-efficiency optimization for a better intelligence-efficiency tradeoff.
During training, we specifically optimized for token efficiency, with the goal of accomplishing tasks using more concise outputs. On the full Artificial Analysis evaluation suite, Ling-2.6-flash uses only 15M tokenswhile still delivering competitive performance. This translates into a meaningfully stronger intelligence-efficiency profile.
Targeted improvements for agent scenarios.
For the agent use cases seeing the strongest demand today, we continuously refined Ling-2.6-flash in tool use, multi-step planning, and task execution. As a result, the model achieves performance that is competitive with, and in some cases reaches SOTA level against, models with larger active parameter counts on benchmarks including BFCL-V4, TAU2-bench, SWE-bench Verified, Claw-Eval, and PinchBench.

Evaluation

Beyond agent tasks, Ling-2.6-flash also delivers strong performance across general knowledge,mathematical reasoning, in struction following, and long-context understanding, remains well aligned with SOTA models in the same size class.

<font style="color:rgb(38, 38, 38);">PinchBench</font><font style="color:rgb(38, 38, 38);">: Comparative scores are retrieved directly from the official PinchBench leaderboard (as of April 20, 2026), adhering to their evaluation modes (potentially Reasoning Mode). </font>

<font style="color:rgb(38, 38, 38);">Claw-Eval</font><font style="color:rgb(38, 38, 38);">: Comparative scores are sourced from the official Claw-Eval leaderboard (version dated 2026-03-25), adhering to their evaluation modes (potentially Reasoning Mode). Official scores for GPT-OSS-120B and GPT-5.4-mini are currently unavailable and have been omitted.</font>

<font style="color:rgb(38, 38, 38);">TAU2-Bench</font><font style="color:rgb(38, 38, 38);">: Evaluations are conducted using official v1.0.0 code and datasets. Following the GLM-5 evaluation protocol, we applied minor prompt adjustments in the Retail and Telecom domains to ensure users express requests clearly and to prevent premature session termination. Additionally, GPT-5.2 was utilized as the User Agent across all evaluated domains.</font>

<font style="color:rgb(38, 38, 38);">IFBench</font><font style="color:rgb(38, 38, 38);">: Scores for GPT-OSS-120B (low) and GPT-5.4-mini (Non-Reasoning) are sourced from the AA (Artificial Analysis) Leaderboard. All other model performance data are based on internal evaluation results.</font>

Quantization Robustness: FP8 and INT4

We evaluate the FP8 and INT4 quantized models using several datasets. The FP8 and INT4 quantizations are applied via the blockwise quantization and groupwise quantization, respectively.

Architecture

Quickstart

SGLang (Recommended)

Environment Preparation

pip install uv

uv venv ~/my_ling_env

source ~/my_ling_env/bin/activate

# uv pip "sglang-kernel>=0.4.1"
uv pip install "sglang[all]>=0.5.10.post1" --prerelease=allow

Run Inference

Server

1. Standard Inference (Without MTP)

python -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --tp-size 4 \
    --pp-size 1 \
    --dp-size 1 \
    --trust-remote-code \
    --context-length 262144 \
    --tool-call-parser qwen25 \
    --json-model-override-args '{"rope_scaling": {"rope_type": "yarn", "factor": 2.0, "rope_theta": 6000000, "partial_rotary_factor": 0.5, "original_max_position_embeddings": 131072}}' \
    --dist-init-addr $MASTER_IP:2345 \
    --port $PORT \
    --nnodes 1

Install our SGLang

git clone -b ling_2_6 git@github.com:antgroup/sglang.git
cd sglang

pip install --upgrade pip
pip install -e "python"

Start server

python -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --tp-size 4 \
    --pp-size 1 \
    --dp-size 1 \
    --context-length 262144 \
    --mamba-scheduler-strategy extra_buffer \
    --speculative-algorithm NEXTN \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4 \
    --mem-fraction-static 0.75 \
    --max-running-requests 64 \
    --max-mamba-cache-size 256 \
    --tool-call-parser qwen25 \
    --json-model-override-args '{"rope_scaling": {"rope_type": "yarn", "factor": 2.0, "rope_theta": 6000000, "partial_rotary_factor": 0.5, "original_max_position_embeddings": 131072}}' \
    --trust-remote-code \
    --dist-init-addr $MASTER_IP:2345 \
    --port $PORT \
    --nnodes 1

Client

curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'

vLLM

Environment Preparation

pip install uv

uv venv ~/my_ling_env

source ~/my_ling_env/bin/activate

git clone https://github.com/vllm-project/vllm.git

cd vllm

VLLM_USE_PRECOMPILED=1 uv pip install --editable . --torch-backend=auto

Run inference

Server

vllm serve $MODEL_PATH \
    --port $PORT \
    --served-model-name my_model \
    --trust-remote-code --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.85

Client

curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'

Limitations & Future Plans

Author: inclusionAI

Likes: 4

Downloads: 0

Tags: safetensors, bailing_hybrid, custom_code, en, license:mit, fp8, region:us

Abiray/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-GGUF

base_model: nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 library_name: gguf license: other tags:

nemotron
moe
mamba2
reasoning
gguf
llama-cpp

Nemotron-3-Nano-Omni-30B-A3B-Reasoning - GGUF

This repository contains high-fidelity GGUF quantizations of NVIDIA's Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16.

These quantizations were explicitly compiled to maximize logic, reasoning, and narrative consistency for local deployments, making them highly suitable for text-based RPG engines, structured JSON output.

🧠 High-Fidelity Quantization Strategy: FP16 Output Head

Unlike standard GGUF conversions, these models were quantized using the --leave-output-tensor flag.

What does this mean? The final projection layer (lm_head / output.weight), which maps the model's internal states to the vocabulary, has been preserved in pristine FP16 precision (~2.1 GB). While this slightly increases the overall file size and initial HDD load time, it completely eliminates the "numerical noise" introduced when crushing the output head to 4-bit or 5-bit.

The Result: Smaller quantizations (like Q3_K_M or Q4_K_M) retain the sharp logic, precise tool-calling, and chain-of-thought (<think>) capabilities of the massive uncompressed model, all while fitting comfortably into local RAM constraints.

⚙️ Model Architecture & Hardware Requirements

Architecture: Mamba2-Transformer Hybrid Mixture of Experts (MoE)
Parameters: 30 Billion total
Active Parameters: ~3 Billion per token (A3B)
Context Length: Up to 256k tokens

Because this is a sparse MoE model, it requires significantly less RAM bandwidth and compute power than a dense 30B model. A Q4_K_M variant will easily run on machines with 8GB to 16GB of system RAM. The primary bottleneck will be the initial model loading time.

📂 Available Quantizations

| Quantization | Bits / Weight | Use Case / Notes | |:---|:---:|:---| | Q8_0 | 8.5 | Extreme fidelity. Best if you have high RAM but limited VRAM. | | Q6_K | 6.5 | Excellent balance for 16GB+ systems. Near-perfect F16 parity. | | Q5_K_M | 5.5 | High quality, slightly faster inference than Q6. | | Q4_K_M | 4.8 | [RECOMMENDED] The sweet spot for performance vs. intelligence. FP16 head ensures reasoning stays intact. | | Q4_K_S | 4.5 | Slightly smaller than K_M, minimal quality loss. | | Q3_K_M | 3.5 | Maximum compression. Great for severely resource-constrained setups (8GB RAM). |

💬 Prompt Format (Reasoning Mode)

This model is trained to utilize a chain-of-thought reasoning budget. It natively supports <think> tags before generating its final response.

Chat Template Example:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is 2+2?<|im_end|>
<|im_start|>assistant
<think>
1. The user is asking for a simple arithmetic operation.
2. The operation is addition: 2 + 2.
3. The result of 2 + 2 is 4.
</think>
The answer is 4.<|im_end|>

Author: Abiray

Likes: 3

Downloads: 0

Tags: gguf, nemotron, moe, mamba2, reasoning, llama-cpp, base_model:nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16, base_model:quantized:nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16, license:other, endpoints_compatible, region:us, conversational

mlx-community/Laguna-XS.2-4bit

library_name: mlx inference: false extra_gated_description: To learn more about how we process your personal data, please read our <a href="https://poolside.ai/legal/privacy">Privacy Policy</a>. tags:

laguna-xs.2
mlx license: apache-2.0 pipeline_tag: text-generation base_model: poolside/Laguna-XS.2

mlx-community/Laguna-XS.2-4bit

This model mlx-community/Laguna-XS.2-4bit was converted to MLX format from poolside/Laguna-XS.2 using mlx-lm version 0.31.3.

Use with mlx

pip install mlx-lm

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Laguna-XS.2-4bit")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True, return_dict=False,
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)

Author: mlx-community

Likes: 3

Downloads: 0

Tags: mlx, safetensors, laguna, laguna-xs.2, text-generation, conversational, custom_code, base_model:poolside/Laguna-XS.2, base_model:quantized:poolside/Laguna-XS.2, license:apache-2.0, 4-bit, region:us

oumoumad/LTX-2.3-22b-IC-LoRA-Deinterlace

LTX-2.3 22B IC-LoRA Deinterlace

This is a Deinterlace IC-LoRA trained on top of LTX-2.3-22b, designed to remove temporal artifacts found in poorly converted or transcoded footage: interlace combing (true interlaced material left progressive, or shown with the wrong field order), bad linear/bilinear deinterlacing, and frame-blending ghosting from bad FPS conversion (frame averaging, mistimed interpolation). The model takes the corrupted clip as the conditioning input and tries to reconstruct a clean progressive version.

It is based on the LTX-2.3 foundation model.

⚠️ Status: still not yet well tested. Treat this checkpoint as a preview — quality on real-world degraded footage hasn't been thoroughly validated yet. Feedback welcome.

Model Files

ltx-2.3-22b-ic-lora-deinterlace.safetensors

Model Details

Base Model: LTX-2.3-22b
Training Type: IC LoRA
Purpose: Remove interlace combing, wrong-field-order judder, bad-deinterlace softening, and frame-blending ghosting from input video
Training Steps: 5000

🔌 Using in ComfyUI

Copy the LoRA weights into models/loras.
Use the IC-LoRA workflow from the LTX-2 ComfyUI repository.
Load the LoRA using the LTXICLoRALoaderModelOnly node.

License

See the LTX-2-community-license for full terms.

Author: oumoumad

Likes: 3

Downloads: 0

Tags: region:us

cHunter789/Qwen3.6-27B-i1-IQ4_XS-GGUF

base_model: Qwen/Qwen3.6-27B language:

en license: apache-2.0 tags:
text-generation-inference
transformers
qwen
gguf
llama.cpp
quantized pipeline_tag: text-generation

Qwen3.6-27B-i1-IQ4_XS (Fully Optimized)

Motivation

Recent updates in the llama.cpp repository (specifically commit 1dab5f5a44) introduced a hardcoded minimum quantization of q5_K for attn_qkv layers. While this was likely intended to preserve model quality, it causes a noticeable bloat in the final file sizes.

For comparison, the highly efficient Qwen3.5-27B iq4_xs by mradermacher weighed in at 14.7GB, whereas the equivalent Qwen3.6 i1-GGUF under the new commit rules swelled to over 15.1GB.

Methodology

To restore the optimal balance of size and performance, I modified the llama.cpp source code to revert the quantization of attn_qkv layers back to a pure IQ4_XS format. This mirrors the exact 1:1 layer quantization strategy originally used in mradermacher's Qwen3.5-27B release.

This model was quantized utilizing the imatrix provided by mradermacher: Qwen3.6-27B-i1-GGUF.

Performance vs. Size Trade-off

Extensive perplexity testing (llama-perplexity with pg19.txt, 65k context, Q8_0 cache) confirms that forcing pure IQ4_XS across all layers results in a statistically insignificant intelligence drop (+0.0039 PPL) while noticeably reducing the memory footprint.

./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS.gguf -f pg19.txt -c 65536 --chunks 32 -ngl -1 -ctk q8_0 -ctv q8_0 -fa 1 -b 512 -ub 128

./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf -f pg19.txt -c 65536 --chunks 32 -ngl -1 -ctk q8_0 -ctv q8_0 -fa 1 -b 512 -ub 128

🧠 Intelligence (Perplexity) Comparison

| Model Version | Perplexity (PPL) | Difference / Quality Drop | | :--- | :--- | :--- | | Standard IQ4_XS (with q5_K attn_qkv) | 7.3765 ± 0.02760 | Baseline | | Custom IQ4_XS (pure / fully iq4) | 7.3804 ± 0.02762 | + 0.0039 (Negligible) |

Conclusion: By utilizing this custom build, users save 375 MiB of active memory and reduce the static file size closer to the 14.7GB mark, with a practically non-existent impact on output quality (~0.05% PPL variance).

Author: cHunter789

Likes: 3

Downloads: 0

Tags: transformers, gguf, text-generation-inference, qwen, llama.cpp, quantized, text-generation, en, base_model:Qwen/Qwen3.6-27B, base_model:quantized:Qwen/Qwen3.6-27B, license:apache-2.0, endpoints_compatible, region:us, imatrix, conversational

Todays AI Summary

AI Developments: Qwen3 Excels, Image Editing Gets Automated, and More

Research Highlights

Model Updates

Key Takeaways

AI Papers for 2026-04-29

Personalized Worked Example Generation from Student Code Submissions using Pattern-based Knowledge Components

Learning to Think from Multiple Thinkers

Learning to Rotate: Temporal and Semantic Rotary Encoding for Sequential Modeling

Case-Specific Rubrics for Clinical AI Evaluation: Methodology, Validation, and LLM-Clinician Agreement Across 823 Encounters

Scalable Hyperparameter-Divergent Ensemble Training with Automatic Learning Rate Exploration for Large Models

Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis

Green Shielding: A User-Centric Approach Towards Trustworthy AI

Can Current Agents Close the Discovery-to-Application Gap? A Case Study in Minecraft

Governing What You Cannot Observe: Adaptive Runtime Governance for Autonomous AI Agents

Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study

AI Models

inclusionAI/Ling-2.6-flash

Ling-2.6-flash: Faster Responses, Stronger Execution, Higher Token Efficiency

Introduction

Evaluation

Architecture

Quickstart

SGLang (Recommended)

Environment Preparation

Run Inference

vLLM

Environment Preparation

Run inference

Limitations & Future Plans

AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS

Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS

What "XS" means — and what it's not

Variants

What this is

Why MTP

🎯 When to pick this variant — measured hardware routing

Usage

vLLM serve — dedicated-VRAM Blackwell (default: MTP via grafted head)

vLLM serve — DGX Spark (DFlash spec, not MTP — measured winning config)

Configuration notes

Quantization recipe

Provenance & credits

License + responsibility

inclusionAI/Ling-2.6-flash-int4

Ling-2.6-flash: Faster Responses, Stronger Execution, Higher Token Efficiency

Introduction

Evaluation

Quantization Robustness: FP8 and INT4

Architecture

Quickstart

SGLang (Recommended)

Environment Preparation

Run Inference

vLLM

Environment Preparation

Run inference

Limitations & Future Plans

oumoumad/LTX-2.3-22b-IC-LoRA-MotionDeblur

LTX-2.3 22B IC-LoRA MotionDeblur

Model Files

Model Details

🔌 Using in ComfyUI

License

lewtun/talkie-1930-13b-it-hf

talkie-1930-13b-it (Transformers format)

Table of Contents

Model Summary

Key Features

How to Use

Installation

Basic Generation

Multi-turn Chat

Chat Template

Architecture Details

Conversion Notes

License

inclusionAI/Ling-2.6-flash-fp8

Ling-2.6-flash: Faster Responses, Stronger Execution, Higher Token Efficiency

Introduction