Todays AI Summary

AI Developments: Token Compression, Open Language Models, and Agentic Workflows Emerge

Here's a look at some of the most interesting AI developments from today, focusing on new models and research papers.

Research Highlights

Several compelling research papers have been published, addressing key challenges and opportunities in the field:

  • Black-Box On-Policy Distillation: The paper "Black-Box On-Policy Distillation of Large Language Models" introduces Generative Adversarial Distillation (GAD), a novel approach for training student LLMs by learning from a teacher model's text outputs without accessing internal data. GAD frames the student LLM as a generator and trains a discriminator to distinguish its responses from the teacher LLM's, creating a minimax game. The discriminator acts as an on-policy reward model that co-evolves with the student, providing stable, adaptive feedback. Experimental results show that GAD consistently surpasses the commonly used sequence-level knowledge distillation.
  • Fully Open Language Models: The paper "Instella: Fully Open Language Models with Stellar Performance" introduces Instella, a family of fully open three billion parameter language models trained entirely on openly available data and codebase. Instella achieves state-of-the-art results among fully open models and is competitive with leading open-weight models of comparable size.
  • Agentic Workflow for Internet Measurement Research: The paper "Towards an Agentic Workflow for Internet Measurement Research" presents ArachNet, the first system demonstrating that LLM agents can independently generate measurement workflows that mimics expert reasoning. ArachNet operates through four specialized agents that mirror expert workflow, from problem decomposition to solution implementation.

Model Spotlight

Here are some of the models that have been released:

  • infgrad/Jasper-Token-Compression-600M: This sentence transformer model introduces dynamic text token compression, achieving a 3x compression rate while maintaining performance. It combines vector distillation with contrastive learning and supports bilingual (Chinese and English) applications. The model has received 4 likes.
  • aquif-ai/aquif-DeepResearch-30B-A3B: This model is a specialized open-weight research agent with a 1 million token context window. It outperforms larger models in agentic tasks and achieves state-of-the-art results in a small package. The model has received 2 likes.
  • Surface-ai/r19372: This code generation model is designed for production-grade software development, featuring advanced memory and multi-language support. It excels at code generation, refactoring, bug detection, and code completion. The model has received 2 likes.
  • Meeteshn/vit_fruit_ripeness_classifier: This model classifies the ripeness of fruits using Vision Transformer (ViT) feature extraction and Logistic Regression. It supports apples, bananas, and oranges, classifying them as fresh, unripe, or rotten. The model has received 2 likes.
  • issai/Qolda: This vision-language model is designed to operate in Kazakh, Russian, and English. Built on top of InternVL3.5 and Qwen3, it demonstrates significant performance improvements for Kazakh while maintaining comparable performance on Russian and English. The model has received 2 likes.
  • Clemylia/Tiny-lamina: This ultra-compact language model is designed for creative text generation, producing random and humorous sequences of words. It is optimized for users without GPUs and is ideal for educational purposes. The model has received 2 likes.
  • Intel/MiniMax-M2-gguf-q2ks-mixed-AutoRound: This model is a mixed gguf q2ks format of MiniMaxAI/MiniMax-M2 generated by intel/auto-round algorithm. The model has received 2 likes.

Key Takeaways

  • Token Compression: Models like Jasper-Token-Compression-600M are pushing the boundaries of efficient text representation, enabling faster processing and reduced memory footprint.
  • Open and Accessible LLMs: The release of Instella highlights the growing trend towards fully open language models, promoting transparency and reproducibility in research.
  • Specialized Agents: Models like aquif-DeepResearch-30B-A3B and the research presented in "Towards an Agentic Workflow for Internet Measurement Research" demonstrate the potential of AI agents to automate complex tasks and workflows.
  • Multilingual Support: Models like issai/Qolda are expanding language capabilities, enabling more inclusive and accessible AI applications.

AI Papers for 2026-02-23

Sink-Aware Pruning for Diffusion Language Models

Diffusion Language Models (DLMs) incur high inference cost due to iterative denoising, motivating efficient pruning. Existing pruning heuristics largely inherited from autoregressive (AR) LLMs, typically preserve attention sink tokens because AR sinks serve as stable global anchors. We show that this assumption does not hold for DLMs: the attention-sink position exhibits substantially higher variance over the full generation trajectory (measured by how the dominant sink locations shift across timesteps), indicating that sinks are often transient and less structurally essential than in AR models. Based on this observation, we propose ${\bf \texttt{Sink-Aware Pruning}}$, which automatically identifies and prunes unstable sinks in DLMs (prior studies usually keep sinks for AR LLMs). Without retraining, our method achieves a better quality-efficiency trade-off and outperforms strong prior pruning baselines under matched compute. Our code is available at https://github.com/VILA-Lab/Sink-Aware-Pruning.

CLEF HIPE-2026: Evaluating Accurate and Efficient Person-Place Relation Extraction from Multilingual Historical Texts

HIPE-2026 is a CLEF evaluation lab dedicated to person-place relation extraction from noisy, multilingual historical texts. Building on the HIPE-2020 and HIPE-2022 campaigns, it extends the series toward semantic relation extraction by targeting the task of identifying person--place associations in multiple languages and time periods. Systems are asked to classify relations of two types - $at$ ("Has the person ever been at this place?") and $isAt$ ("Is the person located at this place around publication time?") - requiring reasoning over temporal and geographical cues. The lab introduces a three-fold evaluation profile that jointly assesses accuracy, computational efficiency, and domain generalization. By linking relation extraction to large-scale historical data processing, HIPE-2026 aims to support downstream applications in knowledge-graph construction, historical biography reconstruction, and spatial analysis in digital humanities.

MARS: Margin-Aware Reward-Modeling with Self-Refinement

Reward modeling is a core component of modern alignment pipelines including RLHF and RLAIF, underpinning policy optimization methods including PPO and TRPO. However, training reliable reward models relies heavily on human-labeled preference data, which is costly and limited, motivating the use of data augmentation. Existing augmentation approaches typically operate at the representation or semantic level and remain agnostic to the reward model's estimation difficulty. In this paper, we propose MARS, an adaptive, margin-aware augmentation and sampling strategy that explicitly targets ambiguous and failure modes of the reward model. Our proposed framework, MARS, concentrates augmentation on low-margin (ambiguous) preference pairs where the reward model is most uncertain, and iteratively refines the training distribution via hard-sample augmentation. We provide theoretical guarantees showing that this strategy increases the average curvature of the loss function hence enhance information and improves conditioning, along with empirical results demonstrating consistent gains over uniform augmentation for robust reward modeling.

Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting

Black-box adversarial attacks on Large Vision-Language Models (LVLMs) are challenging due to missing gradients and complex multimodal boundaries. While prior state-of-the-art transfer-based approaches like M-Attack perform well using local crop-level matching between source and target images, we find this induces high-variance, nearly orthogonal gradients across iterations, violating coherent local alignment and destabilizing optimization. We attribute this to (i) ViT translation sensitivity that yields spike-like gradients and (ii) structural asymmetry between source and target crops. We reformulate local matching as an asymmetric expectation over source transformations and target semantics, and build a gradient-denoising upgrade to M-Attack. On the source side, Multi-Crop Alignment (MCA) averages gradients from multiple independently sampled local views per iteration to reduce variance. On the target side, Auxiliary Target Alignment (ATA) replaces aggressive target augmentation with a small auxiliary set from a semantically correlated distribution, producing a smoother, lower-variance target manifold. We further reinterpret momentum as Patch Momentum, replaying historical crop gradients; combined with a refined patch-size ensemble (PE+), this strengthens transferable directions. Together these modules form M-Attack-V2, a simple, modular enhancement over M-Attack that substantially improves transfer-based black-box attacks on frontier LVLMs: boosting success rates on Claude-4.0 from 8% to 30%, Gemini-2.5-Pro from 83% to 97%, and GPT-5 from 98% to 100%, outperforming prior black-box LVLM attacks. Code and data are publicly available at: https://github.com/vila-lab/M-Attack-V2.

FAMOSE: A ReAct Approach to Automated Feature Discovery

Feature engineering remains a critical yet challenging bottleneck in machine learning, particularly for tabular data, as identifying optimal features from an exponentially large feature space traditionally demands substantial domain expertise. To address this challenge, we introduce FAMOSE (Feature AugMentation and Optimal Selection agEnt), a novel framework that leverages the ReAct paradigm to autonomously explore, generate, and refine features while integrating feature selection and evaluation tools within an agent architecture. To our knowledge, FAMOSE represents the first application of an agentic ReAct framework to automated feature engineering, especially for both regression and classification tasks. Extensive experiments demonstrate that FAMOSE is at or near the state-of-the-art on classification tasks (especially tasks with more than 10K instances, where ROC-AUC increases 0.23% on average), and achieves the state-of-the-art for regression tasks by reducing RMSE by 2.0% on average, while remaining more robust to errors than other algorithms. We hypothesize that FAMOSE's strong performance is because ReAct allows the LLM context window to record (via iterative feature discovery and evaluation steps) what features did or did not work. This is similar to a few-shot prompt and guides the LLM to invent better, more innovative features. Our work offers evidence that AI agents are remarkably effective in solving problems that require highly inventive solutions, such as feature engineering.

Reverso: Efficient Time Series Foundation Models for Zero-shot Forecasting

Learning time series foundation models has been shown to be a promising approach for zero-shot time series forecasting across diverse time series domains. Insofar as scaling has been a critical driver of performance of foundation models in other modalities such as language and vision, much recent work on time series foundation modeling has focused on scaling. This has resulted in time series foundation models with hundreds of millions of parameters that are, while performant, inefficient and expensive to use in practice. This paper describes a simple recipe for learning efficient foundation models for zero-shot time series forecasting that are orders of magnitude smaller. We show that large-scale transformers are not necessary: small hybrid models that interleave long convolution and linear RNN layers (in particular DeltaNet layers) can match the performance of larger transformer-based models while being more than a hundred times smaller. We also describe several data augmentation and inference strategies that further improve performance. This recipe results in Reverso, a family of efficient time series foundation models for zero-shot forecasting that significantly push the performance-efficiency Pareto frontier.

When to Trust the Cheap Check: Weak and Strong Verification for Reasoning

Reasoning with LLMs increasingly unfolds inside a broader verification loop. Internally, systems use cheap checks, such as self-consistency or proxy rewards, which we call weak verification. Externally, users inspect outputs and steer the model through feedback until results are trustworthy, which we call strong verification. These signals differ sharply in cost and reliability: strong verification can establish trust but is resource-intensive, while weak verification is fast and scalable but noisy and imperfect. We formalize this tension through weak--strong verification policies, which decide when to accept or reject based on weak verification and when to defer to strong verification. We introduce metrics capturing incorrect acceptance, incorrect rejection, and strong-verification frequency. Over population, we show that optimal policies admit a two-threshold structure and that calibration and sharpness govern the value of weak verifiers. Building on this, we develop an online algorithm that provably controls acceptance and rejection errors without assumptions on the query stream, the language model, or the weak verifier.

SMAC: Score-Matched Actor-Critics for Robust Offline-to-Online Transfer

Modern offline Reinforcement Learning (RL) methods find performant actor-critics, however, fine-tuning these actor-critics online with value-based RL algorithms typically causes immediate drops in performance. We provide evidence consistent with the hypothesis that, in the loss landscape, offline maxima for prior algorithms and online maxima are separated by low-performance valleys that gradient-based fine-tuning traverses. Following this, we present Score Matched Actor-Critic (SMAC), an offline RL method designed to learn actor-critics that transition to online value-based RL algorithms with no drop in performance. SMAC avoids valleys between offline and online maxima by regularizing the Q-function during the offline phase to respect a first-order derivative equality between the score of the policy and action-gradient of the Q-function. We experimentally demonstrate that SMAC converges to offline maxima that are connected to better online maxima via paths with monotonically increasing reward found by first-order optimization. SMAC achieves smooth transfer to Soft Actor-Critic and TD3 in 6/6 D4RL tasks. In 4/6 environments, it reduces regret by 34-58% over the best baseline.

Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs

Reinforcement learning (RL) is widely used to improve large language models on reasoning tasks, and asynchronous RL training is attractive because it increases end-to-end throughput. However, for widely adopted critic-free policy-gradient methods such as REINFORCE and GRPO, high asynchrony makes the policy-gradient estimator markedly $\textbf{higher variance}$: training on stale rollouts creates heavy-tailed importance ratios, causing a small fraction of samples to dominate updates. This amplification makes gradients noisy and learning unstable relative to matched on-policy training. Across math and general reasoning benchmarks, we find collapse is reliably predicted by effective sample size (ESS) and unstable gradient norms. Motivated by this diagnosis, we propose $\textbf{V}$ariance $\textbf{C}$ontrolled $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{VCPO}$), a general stabilization method for REINFORCE/GRPO-style algorithms that (i) scales learning rate based on effective sample size to dampen unreliable updates, and (ii) applies a closed-form minimum-variance baseline for the off-policy setting, avoiding an auxiliary value model and adding minimal overhead. Empirically, VCPO substantially improves robustness for asynchronous training across math, general reasoning, and tool-use tasks, outperforming a broad suite of baselines spanning masking/clipping stabilizers and algorithmic variants. This reduces long-context, multi-turn training time by 2.5$\times$ while matching synchronous performance, demonstrating that explicit control of policy-gradient variance is key for reliable asynchronous RL at scale.

Towards Anytime-Valid Statistical Watermarking

The proliferation of Large Language Models (LLMs) necessitates efficient mechanisms to distinguish machine-generated content from human text. While statistical watermarking has emerged as a promising solution, existing methods suffer from two critical limitations: the lack of a principled approach for selecting sampling distributions and the reliance on fixed-horizon hypothesis testing, which precludes valid early stopping. In this paper, we bridge this gap by developing the first e-value-based watermarking framework, Anchored E-Watermarking, that unifies optimal sampling with anytime-valid inference. Unlike traditional approaches where optional stopping invalidates Type-I error guarantees, our framework enables valid, anytime-inference by constructing a test supermartingale for the detection process. By leveraging an anchor distribution to approximate the target model, we characterize the optimal e-value with respect to the worst-case log-growth rate and derive the optimal expected stopping time. Our theoretical claims are substantiated by simulations and evaluations on established benchmarks, showing that our framework can significantly enhance sample efficiency, reducing the average token budget required for detection by 13-15% relative to state-of-the-art baselines.

AI Models

0xSero/Kimi-K2.5-PRISM-REAP-72


license: other license_name: kimi-k2.5 license_link: https://huggingface.co/moonshotai/Kimi-K2.5/blob/main/LICENSE base_model:

  • moonshotai/Kimi-K2.5
  • Ex0bit/Kimi-K2.5-PRISM-REAP-530B-A32B tags:
  • moe
  • expert-pruning
  • reap
  • deepseek
  • kimi
  • prism
  • int4
  • compressed-tensors
  • consumer-gpu
  • rtx-3090 library_name: transformers pipeline_tag: text-generation

Kimi-K2.5-PRISM-REAP-72

KEEP TEMPERATURE AT 0

81% REAP expert-pruned version of moonshotai/Kimi-K2.5, further pruned from the PRISM-REAP 192-expert variant. Designed to fit on 8x RTX 3090 (24GB) consumer GPUs.

| Property | Value | |----------|-------| | Architecture | KimiK25 (DeepSeekV3 backbone, MLA attention) | | Total Parameters | ~200B (down from ~1T) | | Active Parameters | ~32B (8 experts per token) | | Experts per MoE Layer | 72 routed + 1 shared (down from 384 + 1) | | MoE Layers | 60 (layers 1-60, layer 0 is dense) | | Hidden Size | 7168 | | Attention | MLA (kv_lora_rank=512, q_lora_rank=1536) | | Quantization | W4A16 (group_size=32, symmetric) via compressed-tensors | | Disk Size | 122 GB (down from 289 GB / 555 GB original) | | Pruning Method | REAP (Router-weighted Expert Activation Pruning) | | Vision | Supported (inherited from Kimi-K2.5) |

Why 72 Experts?

72 was chosen because:

  • Divisible by 8: Clean sharding across 8 GPUs for TP/EP
  • ~122 GB total: Fits in 8x 24GB with room for KV cache
  • ~15 GB/GPU weight footprint with Expert Parallelism, leaving ~7 GB for KV cache and overhead
  • Retains the top 72 most salient experts per layer from the original 384

Performance (8x RTX 3090, 155W, vLLM 0.15.1)

| Metric | Value | |--------|-------| | Single request | 33.4 tok/s | | 2 concurrent | 52.5 tok/s | | 4 concurrent | 86.2 tok/s | | 8 concurrent | 145.5 tok/s | | TTFT | 0.08s | | Max context | 57,344 tokens | | Vision | Working |

Recommended vLLM Launch (8x RTX 3090)

VLLM_ATTENTION_BACKEND=TRITON_MLA vllm serve 0xsero/Kimi-K2.5-PRISM-REAP-72 \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --max-model-len 57344 \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 16 \
  --trust-remote-code \
  --tool-call-parser kimi_k2 \
  --reasoning-parser kimi_k2 \
  --enable-auto-tool-choice \
  --enable-chunked-prefill \
  --enable-prefix-caching
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "0xsero/Kimi-K2.5-PRISM-REAP-72",
    trust_remote_code=True,
    dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
    "0xsero/Kimi-K2.5-PRISM-REAP-72",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "What is the capital of France?"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", thinking=False)
inputs = inputs.to(model.device)

outputs = model.generate(inputs, max_new_tokens=512, temperature=0.6, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Pruning Details

This model was created in a two-stage process:

  1. Stage 1 (Ex0bit): REAP pruning of original 384 experts to 192 experts using saliency scores from 512 calibration samples on allenai/tulu-3-sft-mixture
  2. Stage 2 (this model): Further pruning from 192 to 72 experts using the same REAP saliency scores, targeting consumer GPU deployment

Key Technical Details

  • Per-layer top-72 selection: The 72 most salient experts retained independently per layer
  • Gate weight slicing: Router gate weights [192, 7168] sliced to [72, 7168], e_score_correction_bias from [192] to [72]
  • Contiguous expert remapping: Expert indices remapped to 0-71 in each layer
  • All non-expert weights preserved: Attention (MLA), shared expert, embeddings, and LM head unchanged
  • Saliency ordering verified: In every layer, min(retained_saliency) > max(pruned_saliency) selecting the top 72

What is REAP?

REAP (Cerebras Research, 2025) is a one-shot expert pruning method for MoE models:

S_j = (1 / |X_j|) * SUM_{x in X_j} [ g_j(x) * ||f_j(x)||_2 ]

Where g_j(x) is the normalized gate weight and ||f_j(x)||_2 is the L2 norm of expert j's output.

What is PRISM?

The base model was treated with the PRISM-LITE pipeline, softening over-refusal and bias behaviors while preserving model quality.

Optimization Notes

  • Expert Parallelism (EP) is critical on PCIe GPUs -- reduces per-GPU model memory from ~23 GB to ~17 GB
  • TRITON_MLA is the only MLA backend available on Ampere (CC 8.6)
  • FP8 KV cache is not supported with MLA on Ampere; MLA's built-in KV compression (kv_lora_rank=512) already provides ~14x efficiency vs standard MHA
  • Uniform GPU power limits prevent synchronization stalls in TP/EP configurations

Citation

@article{reap2025,
  title={REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author={Cerebras Research},
  journal={arXiv preprint arXiv:2510.13999},
  year={2025}
}

Acknowledgments

Author: 0xSero

Likes: 3

Downloads: 0

Tags: transformers, safetensors, kimi_k25, feature-extraction, moe, expert-pruning, reap, deepseek, kimi, prism, int4, compressed-tensors, consumer-gpu, rtx-3090, text-generation, conversational, custom_code, arxiv:2510.13999, base_model:Ex0bit/Kimi-K2.5-PRISM-REAP-530B-A32B, base_model:quantized:Ex0bit/Kimi-K2.5-PRISM-REAP-530B-A32B, license:other, region:us

ubergarm/Qwen3-Coder-Next-GGUF


quantized_by: ubergarm pipeline_tag: text-generation base_model: Qwen/Qwen3-Coder-Next base_model_relation: quantized license: apache-2.0 tags:

  • imatrix
  • conversational
  • qwen3_next
  • ik_llama.cpp

ik_llama.cpp imatrix Quantizations of Qwen/Qwen3-Coder-Next

NOTE ik_llama.cpp can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.

Some of ik's new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP with Windows builds. Also check for ik_llama.cpp windows builds by Thireus here..

These quants provide best in class perplexity for the given memory footprint.

Big Thanks

Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!!

Also thanks to all the folks in the quanting and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models! Thanks to huggingface for hosting all these big quants!

Finally, I really appreciate the support from aifoundry.org so check out their open source RISC-V based solutions!

Quant Collection

Perplexity computed against wiki.test.raw. (lower is "better")

Perplexity Chart

These two are just test quants for baseline perplexity comparison and not available for download here:

  • BF16 148.502 GiB (16.010 BPW)
    • PPL over 584 chunks for n_ctx=512 = 8.2278 +/- 0.06392
  • Q8_0 78.982 GiB (8.515 BPW)
    • PPL over 584 chunks for n_ctx=512 = 8.2239 +/- 0.06389

NOTE: The first split file is much smaller on purpose to only contain metadata, its fine!

IQ4_KSS 39.377 GiB (4.245 BPW)

PPL over 584 chunks for n_ctx=512 = 8.3069 +/- 0.06459

<details> <summary>πŸ‘ˆ Secret Recipe</summary>
#!/usr/bin/env bash

custom="
# 60 Repeating Layers [0-59]

## Gated Attention/Delta Net [Blended 0-59]
blk\..*\.attn_gate\.weight=q8_0
blk\..*\.attn_qkv\.weight=q8_0
blk\..*\.attn_output\.weight=q8_0
blk\..*\.attn_q\.weight=q8_0
blk\..*\.attn_k\.weight=q8_0
blk\..*\.attn_v\.weight=q8_0
blk\..*\.ssm_ba\.weight=q8_0
blk\..*\.ssm_out\.weight=q8_0

# Shared Expert Layers [0-59]
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

# Routed Experts Layers [0-59]
blk\..*\.ffn_down_exps\.weight=iq4_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kss

# Non-Repeating Layers
token_embd\.weight=iq6_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-quantize \
    --dry-run \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/Qwen3-Coder-Next-GGUF/imatrix-Qwen3-Coder-Next-BF16.dat \
    /mnt/data/models/ubergarm/Qwen3-Coder-Next-GGUF/Qwen3-Coder-Next-512x2.5B-BF16-00001-of-00004.gguf \
    /mnt/data/models/ubergarm/Qwen3-Coder-Next-GGUF/Qwen3-Coder-Next-IQ4_KSS.gguf \
    IQ4_KSS \
    128
</details>

smol-IQ2_KS 22.097 GiB (2.382 BPW)

PPL over 584 chunks for n_ctx=512 = 9.4488 +/- 0.07565

<details> <summary>πŸ‘ˆ Secret Recipe</summary>
#!/usr/bin/env bash

custom="
# 60 Repeating Layers [0-59]

## Gated Attention/Delta Net [Blended 0-59]
blk\..*\.attn_gate\.weight=q8_0
blk\..*\.attn_qkv\.weight=q8_0
blk\..*\.attn_output\.weight=q8_0
blk\..*\.attn_q\.weight=q8_0
blk\..*\.attn_k\.weight=q8_0
blk\..*\.attn_v\.weight=q8_0
blk\..*\.ssm_ba\.weight=q8_0
blk\..*\.ssm_out\.weight=q8_0

# Shared Expert Layers [0-59]
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

# Routed Experts Layers [0-59]
blk\..*\.ffn_down_exps\.weight=iq2_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_ks

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-quantize \
    --dry-run \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/Qwen3-Coder-Next-GGUF/imatrix-Qwen3-Coder-Next-BF16.dat \
    /mnt/data/models/ubergarm/Qwen3-Coder-Next-GGUF/Qwen3-Coder-Next-512x2.5B-BF16-00001-of-00004.gguf \
    /mnt/data/models/ubergarm/Qwen3-Coder-Next-GGUF/Qwen3-Coder-Next-smol-IQ2_KS.gguf \
    IQ2_KS \
    128
</details>

Quick Start

Check some recent model cards for examples on running models.

# Clone and checkout
$ git clone https://github.com/ikawrakow/ik_llama.cpp
$ cd ik_llama.cpp

# Build for hybrid CPU+CUDA
$ cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON
$ cmake --build build --config Release -j $(nproc)

# Download Desired Quants
$ pip install huggingface_hub
$ hf download --local-dir ./ --include=smol-IQ2_XS/*.gguf ubergarm/Qwen3-Coder-Next-GGUF

# Full GPU offload
# For 2 or more GPUs keep an eye on `-sm graph` support:
# https://github.com/ikawrakow/ik_llama.cpp/pull/1292
CUDA_VISIBLE_DEVICES="0,1" \
./build/bin/llama-server \
  --model "$model" \
  --alias Qwen3-Coder-Next \
  -c 262144 \
  -fa on \
  -ger \
  --merge-qkv \
  -sm graph \
  -ngl 99 \
  -ub 2048 -b 2048 \
  --threads 1 \
  --host 127.0.0.1 \
  --port 8080 \
  --jinja \
  --no-mmap

# Hybrid CPU+GPU
# basically use --n-cpu-moe etc...
echo TODO

# CPU-Only
# Gated delta net CPU-only performance seems slower than other architechtures, ideally have at least 1x GPU for attn/kv-cache
numactl -N "$SOCKET" -m "$SOCKET" \
./build/bin/llama-server \
    --model "$model"\
    --alias Qwen3-Coder-Next \
    --ctx-size 131072 \
    -ger \
    --merge-qkv \
    -ctk q8_0 -ctv q8_0 \
    -ub 4096 -b 4096 \
    --parallel 1 \
    --threads 96 \
    --threads-batch 128 \
    --numa numactl \
    --host 127.0.0.1 \
    --port 8080 \
    --no-mmap \
    --jinja

References

Author: ubergarm

Likes: 3

Downloads: 0

Tags: gguf, imatrix, conversational, qwen3_next, ik_llama.cpp, text-generation, base_model:Qwen/Qwen3-Coder-Next, base_model:quantized:Qwen/Qwen3-Coder-Next, license:apache-2.0, endpoints_compatible, region:us

paperscarecrow/Gemma3MoOLET


license: agpl-3.0 base_model:

  • mlabonne/gemma-3-12b-it-qat-abliterated
  • mlabonne/gemma-3-4b-it-abliterated

Polymath Swarm: Dynamic Mixture-of-Experts via O-TITANS (MoOLE-T)

The Paradigm Shift The current open-source meta relies on monolithic, massive parameter models (70B+) to achieve multi-domain competency. This approach is computationally expensive, hardware-restrictive, and prone to catastrophic forgetting during fine-tuning.

The Polymath Swarm introduces the MoOLE-T architecture (Mixture-of-Orthogonal-LoRA-Experts with TITANS routing). Instead of one massive brain, we use a lightweight cognitive router to dynamically hot-swap hyper-specialized "Engrams" onto a mid-sized synthesis core in real-time.

The Architecture

The Brainstem (Cognitive Router): Powered by gemma-3-4b-it. It intercepts the user prompt, utilizes a <think> block to decompose the task, and fires a deterministic routing token (e.g., [ROUTE: code_python]).

The Orchestrator: A localized Python controller that catches the routing token, retrieves the required skill from an engrams.json dictionary, and hot-swaps the physical weights into VRAM in milliseconds.

The Frontal Lobe (Synthesis Core): Powered by gemma-3-12b-it. It acts as the execution engine. It idles in a sterile, baseline state until the Orchestrator mounts a specialized engram to its attention matrices.

The Engrams (O-TITANS Tools): These are not standard LoRAs. They are forged using the Orthogonal-TITANS matrix penalty we published previously [Link to O-TITANS post]. By strictly isolating the fine-tune to the q_proj and v_proj layers and mathematically punishing dimensional overlap, we can inject extreme domain expertise (like advanced Python asyncio networking) without degrading the model's foundational conversational alignment.

The Vision: An "App Store" for Cognition Included in this repository is our first production engram: otitans_code_python.pt.

However, the true goal of the MoOLE-T framework is the creation of a community-driven repository of hot-swappable skills. Users shouldn't have to download a new 20GB model just because they want their AI to analyze medical documents or write cyberpunk fiction. You should be able to download a 25MB .pt file, drop it into your /adapters/ folder, update your engrams.json, and instantly grant your Swarm a new capability.

The Roadmap: "Featherweight" Edge Deployment While this V1 release utilizes a 4B/12B dynamic, we are actively developing the "Nano" variant. By deep-frying the gemma-3-270m-it into a pure stimulus-response Reflex Arc, we will bring this dynamic Mixture-of-Experts architecture to CPU-only and edge devices.

Links & Assets

O-TITANS Gemma 3 Adapters (Proof of Concept): (https://huggingface.co/paperscarecrow/O-TITANS-Gemma3)
Training Scripts & Surgery Methodology: (https://github.com/PaperScarecrow/O-TITANS)

Credits & Resources

A massive credit to the foundational work that made this possible:

ffurfaro for the TPTT "titanesque" methodologies that inspired the titanized-lora structural approach.
mlabonne for the BF16 Gemma-3-abliteration models. The zeroed vectors from his minosv1 process are what make the underlying synthesis actually work without semantic contamination.
Google for the TITANS research

Author: paperscarecrow

Likes: 2

Downloads: 0

Tags: safetensors, base_model:mlabonne/gemma-3-12b-it-qat-abliterated, base_model:finetune:mlabonne/gemma-3-12b-it-qat-abliterated, license:agpl-3.0, region:us

huihui-ai/Huihui-Kimi-K2.5-BF16-abliterated-GGUF


license: other license_name: modified-mit library_name: transformers pipeline_tag: image-text-to-text base_model:

  • moonshotai/Kimi-K2.5 tags:
  • abliterated
  • uncensored
  • GGUF

huihui-ai/Huihui-Kimi-K2.5-BF16-abliterated-GGUF

This is an uncensored version of moonshotai/Kimi-K2.5 created with abliteration (see remove-refusals-with-transformers to know more about it).

Download and merge

Use the llama.cpp split program to merge model (llama-gguf-split needs to be compiled.),

huggingface-cli download huihui-ai/Huihui-Kimi-K2.5-BF16-abliterated-GGUF --local-dir ./huihui-ai/Huihui-Kimi-K2.5-BF16-abliterated-GGUF --token xxx

llama-gguf-split --merge huihui-ai/Huihui-Kimi-K2.5-BF16-abliterated-GGUF/Q2_K-GGUF/Q2_K-GGUF-00001-of-00041.gguf huihui-ai/Huihui-Kimi-K2.5-BF16-abliterated-GGUF/Q2_K.gguf

Process image

llama-mtmd-cli -m huihui-ai/Huihui-Kimi-K2.5-BF16-abliterated-GGUF/Q2_K.gguf --mmproj huihui-ai/Huihui-Kimi-K2.5-BF16-abliterated-GGUF/mmproj-model-f16.gguf -c 40960 --image cars.jpg -p "Describe this image" 

Usage Warnings

  • Risk of Sensitive or Controversial Outputs: This model’s safety filtering has been significantly reduced, potentially generating sensitive, controversial, or inappropriate content. Users should exercise caution and rigorously review generated outputs.

  • Not Suitable for All Audiences: Due to limited content filtering, the model’s outputs may be inappropriate for public settings, underage users, or applications requiring high security.

  • Legal and Ethical Responsibilities: Users must ensure their usage complies with local laws and ethical standards. Generated content may carry legal or ethical risks, and users are solely responsible for any consequences.

  • Research and Experimental Use: It is recommended to use this model for research, testing, or controlled environments, avoiding direct use in production or public-facing commercial applications.

  • Monitoring and Review Recommendations: Users are strongly advised to monitor model outputs in real-time and conduct manual reviews when necessary to prevent the dissemination of inappropriate content.

  • No Default Safety Guarantees: Unlike standard models, this model has not undergone rigorous safety optimization. huihui.ai bears no responsibility for any consequences arising from its use.

Donation

If you like it, please click 'like' and follow us for more updates.
You can follow x.com/support_huihui to get the latest model information from huihui.ai.

Your donation helps us continue our further development and improvement, a cup of coffee can do it.
  • bitcoin(BTC):
  bc1qqnkhuchxw0zqjh2ku3lu4hq45hc6gy84uk70ge
  • Support our work on Ko-fi (https://ko-fi.com/huihuiai)!

Author: huihui-ai

Likes: 2

Downloads: 0

Tags: transformers, gguf, abliterated, uncensored, GGUF, image-text-to-text, base_model:moonshotai/Kimi-K2.5, base_model:quantized:moonshotai/Kimi-K2.5, license:other, endpoints_compatible, region:us, conversational

DarkArtsForge/Magistaroth-24B-v1


license: apache-2.0 base_model:

  • mistralai/Magistral-Small-2509
  • Gryphe/Tiamat-24B-Magistral
  • TheDrummer/Magidonia-24B-v4.3
  • TheDrummer/Precog-24B-v1
  • zerofata/MS3.2-PaintedFantasy-v3-24B
  • zerofata/MS3.2-PaintedFantasy-v4.1-24B datasets:
  • OccultAI/illuminati_imatrix_v1 language:
  • en library_name: transformers tags:
  • DELLA
  • merge
  • mergekit widget:
    • text: "Magistaroth 24B v1" output: url: https://cdn-uploads.huggingface.co/production/uploads/68e840caa318194c44ec2a04/n2QI4o6Xx2d5QUhsLvmt3.png

[!CAUTION] <span style="color:red; font-weight:bold">⚠️ Warning:</span> This model can produce narratives and RP that contain violent and graphic erotic content. Adjust your system prompt accordingly, and use Mistral Tekken chat template.

🌌 Magistaroth 24B v1

Magistaroth

A highly creative merge. Some refusals but you can use jailbreaks or ablate the model. A normtrue version was tested. Normfalse did better overall, it was slightly less censored, more detailed and creative.

Scores 14152 at Q0 Bench (Pass Q0G).

This model was merged using the following merge method: <a href="https://arxiv.org/abs/2406.11617">DELLA</a>

architecture: MistralForCausalLM
models:
  - model: B:\24B\!models--mistralai--Magistral-Small-2509\textonly
  - model: B:\24B\!models--Gryphe--Tiamat-24B-Magistral\textonly
    parameters:
      density: 0.9
      weight: 0.4
      epsilon: 0.099
  - model: B:\24B\!models--TheDrummer--Magidonia-24B-v4.3
    parameters:
      density: 0.9
      weight: 0.4
      epsilon: 0.099
  - model: B:\24B\!models--TheDrummer--Precog-24B-v1
    parameters:
      density: 0.9
      weight: 0.4
      epsilon: 0.099
  - model: B:\24B\!models--zerofata--MS3.2-PaintedFantasy-v3-24B
    parameters:
      density: 0.9
      weight: 0.4
      epsilon: 0.099
  - model: B:\24B\!models--zerofata--MS3.2-PaintedFantasy-v4.1-24B
    parameters:
      density: 0.9
      weight: 0.4
      epsilon: 0.099
# Seed: 420
merge_method: della
base_model: B:\24B\!models--mistralai--Magistral-Small-2509\textonly
parameters:
  lambda: 1.0
  normalize: false
  int8_mask: false
dtype: float32
out_dtype: bfloat16
tokenizer:
  source: B:\24B\!models--TheDrummer--Magidonia-24B-v4.3
# chat_template: auto
name: 🌌 Magistaroth-24B-v1

MagiAudit

Author: DarkArtsForge

Likes: 2

Downloads: 0

Tags: transformers, safetensors, mistral, text-generation, DELLA, merge, mergekit, conversational, en, dataset:OccultAI/illuminati_imatrix_v1, arxiv:2406.11617, base_model:Gryphe/Tiamat-24B-Magistral, base_model:merge:Gryphe/Tiamat-24B-Magistral, base_model:TheDrummer/Magidonia-24B-v4.3, base_model:merge:TheDrummer/Magidonia-24B-v4.3, base_model:TheDrummer/Precog-24B-v1, base_model:merge:TheDrummer/Precog-24B-v1, base_model:mistralai/Magistral-Small-2509, base_model:merge:mistralai/Magistral-Small-2509, base_model:zerofata/MS3.2-PaintedFantasy-v3-24B, base_model:merge:zerofata/MS3.2-PaintedFantasy-v3-24B, base_model:zerofata/MS3.2-PaintedFantasy-v4.1-24B, base_model:merge:zerofata/MS3.2-PaintedFantasy-v4.1-24B, license:apache-2.0, text-generation-inference, endpoints_compatible, region:us

mradermacher/ERNIE-21B-A3B-Claude-4.5-High-OPUS-Thinking-i1-GGUF


base_model: DavidAU/ERNIE-21B-A3B-Claude-4.5-High-OPUS-Thinking datasets:

  • TeichAI/claude-4.5-opus-high-reasoning-250x language:
  • en library_name: transformers license: apache-2.0 mradermacher: readme_rev: 1 quantized_by: mradermacher tags:
  • mixture of experts
  • moe
  • 64 experts
  • uncensored
  • unsloth
  • finetune
  • All use cases
  • bfloat16
  • creative
  • creative writing
  • fiction writing
  • plot generation
  • sub-plot generation
  • fiction writing
  • story generation
  • scene continue
  • storytelling
  • fiction story
  • science fiction
  • romance
  • all genres
  • story
  • writing
  • vivid prosing
  • vivid writing
  • fiction

About

<!-- ### quantize_version: 2 --> <!-- ### output_tensor_quantised: 1 --> <!-- ### convert_type: hf --> <!-- ### vocab_type: --> <!-- ### tags: nicoboss --> <!-- ### quants: Q2_K IQ3_M Q4_K_S IQ3_XXS Q3_K_M small-IQ4_NL Q4_K_M IQ2_M Q6_K IQ4_XS Q2_K_S IQ1_M Q3_K_S IQ2_XXS Q3_K_L IQ2_XS Q5_K_S IQ2_S IQ1_S Q5_K_M Q4_0 IQ3_XS Q4_1 IQ3_S --> <!-- ### quants_skip: --> <!-- ### skip_mmproj: -->

weighted/imatrix quants of https://huggingface.co/DavidAU/ERNIE-21B-A3B-Claude-4.5-High-OPUS-Thinking

<!-- provided-files -->

For a convenient overview and download list, visit our model page for this model.

static quants are available at https://huggingface.co/mradermacher/ERNIE-21B-A3B-Claude-4.5-High-OPUS-Thinking-GGUF

Usage

If you are unsure how to use GGUF files, refer to one of TheBloke's READMEs for more details, including on how to concatenate multi-part files.

Provided Quants

(sorted by size, not necessarily quality. IQ-quants are often preferable over similar sized non-IQ quants)

| Link | Type | Size/GB | Notes | |:-----|:-----|--------:|:------| | GGUF | imatrix | 0.1 | imatrix file (for creating your own quants) | | GGUF | i1-IQ1_S | 4.6 | for the desperate | | GGUF | i1-IQ1_M | 5.1 | mostly desperate | | GGUF | i1-IQ2_XXS | 5.9 | | | GGUF | i1-IQ2_XS | 6.6 | | | GGUF | i1-IQ2_S | 6.6 | | | GGUF | i1-IQ2_M | 7.3 | | | GGUF | i1-Q2_K_S | 7.5 | very low quality | | GGUF | i1-Q2_K | 8.2 | IQ3_XXS probably better | | GGUF | i1-IQ3_XXS | 8.6 | lower quality | | GGUF | i1-IQ3_XS | 9.1 | | | GGUF | i1-Q3_K_S | 9.6 | IQ3_XS probably better | | GGUF | i1-IQ3_S | 9.6 | beats Q3_K* | | GGUF | i1-IQ3_M | 9.7 | | | GGUF | i1-Q3_K_M | 10.6 | IQ3_S probably better | | GGUF | i1-Q3_K_L | 11.5 | IQ3_M probably better | | GGUF | i1-IQ4_XS | 11.8 | | | GGUF | i1-Q4_0 | 12.5 | fast, low quality | | GGUF | i1-Q4_K_S | 12.5 | optimal size/speed/quality | | GGUF | i1-Q4_K_M | 13.3 | fast, recommended | | GGUF | i1-Q4_1 | 13.8 | | | GGUF | i1-Q5_K_S | 15.2 | | | GGUF | i1-Q5_K_M | 15.6 | | | GGUF | i1-Q6_K | 18.0 | practically like static Q6_K |

Here is a handy graph by ikawrakow comparing some lower-quality quant types (lower is better):

image.png

And here are Artefact2's thoughts on the matter: https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9

FAQ / Model Request

See https://huggingface.co/mradermacher/model_requests for some answers to questions you might have and/or if you want some other model quantized.

Thanks

I thank my company, nethype GmbH, for letting me use its servers and providing upgrades to my workstation to enable this work in my free time. Additional thanks to @nicoboss for giving me access to his private supercomputer, enabling me to provide many more imatrix quants, at much higher quality, than I would otherwise be able to.

<!-- end -->

Author: mradermacher

Likes: 2

Downloads: 0

Tags: transformers, gguf, mixture of experts, moe, 64 experts, uncensored, unsloth, finetune, All use cases, bfloat16, creative, creative writing, fiction writing, plot generation, sub-plot generation, story generation, scene continue, storytelling, fiction story, science fiction, romance, all genres, story, writing, vivid prosing, vivid writing, fiction, en, dataset:TeichAI/claude-4.5-opus-high-reasoning-250x, base_model:DavidAU/ERNIE-21B-A3B-Claude-4.5-High-OPUS-Thinking, base_model:quantized:DavidAU/ERNIE-21B-A3B-Claude-4.5-High-OPUS-Thinking, license:apache-2.0, endpoints_compatible, region:us, imatrix, conversational

changcheng967/flashlm-v5-thunderbolt


title: FlashLM v5 Thunderbolt emoji: ⚑ colorFrom: yellow colorTo: orange sdk: gradio sdk_version: 4.0.0 app_file: demo_v5.py pinned: false tags:

  • cpu
  • matmul-free
  • language-model
  • ternary-weights
  • efficient
  • pytorch new_version: changcheng967/flashlm-v5.2-nova-ignition

FlashLM v5 "Thunderbolt" ⚑

A 29.7M parameter matmul-free language model trained entirely on CPU without GPUs.

Model Description

FlashLM v5 "Thunderbolt" is a revolutionary language model that was pre-trained from scratch on consumer hardware β€” without any GPUs. It uses a novel MatMul-free architecture called ParallelGatedRecurrence with ternary weights (BitLinear) to achieve dramatic efficiency improvements.

Key Achievements

  • Final PPL: 1.36 β€” Beats the TinyStories-1M baseline (PPL 1.59)!
  • Final BPC: 0.44
  • Training Time: ~40 hours on AMD Ryzen 7950X3D
  • Training Data: ~1B tokens from TinyStories

Architecture

FlashLM v5 uses ParallelGatedRecurrence β€” a matmul-free token mixer that replaces attention with:

  • Ternary weights (BitLinear): Quantized to {-1, 0, +1} reducing memory 16x
  • Parallel gated recurrence: Learned decay gates for efficient context
  • No matrix multiplications in the forward pass!
Parameters:     29,750,784
Ternary:       26,542,080 (89%)
Float:          3,208,704 (11%)
Ternary size:   ~6.6 MB (vs 119 MB float32)

Usage

With Gradio Demo

from demo_v5 import ThunderboltLM, load_model
from tokenizers import Tokenizer

# Load model
load_model("FlashLM_v5_Results")

# Generate text
prompt = "Once upon a time"
ids = tokenizer.encode(prompt).ids
x = torch.tensor([ids])
out = model.generate(x, max_new_tokens=100, temperature=0.8)
print(tokenizer.decode(out[0].tolist()))

Direct Model Loading

import torch
from ThunderboltLM import ThunderboltLM
from tokenizers import Tokenizer

# Load tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")

# Create model
model = ThunderboltLM(
    vocab=8192,
    d_model=384,
    n_heads=8,
    d_head=48,
    n_layers=18,
    d_ffn=1152
)

# Load weights
state_dict = torch.load("best.pt", map_location="cpu", weights_only=True)
model.load_state_dict(state_dict)
model.eval()

# Generate
ids = tokenizer.encode("Once upon a time").ids
out = model.generate(torch.tensor([ids]), max_new_tokens=100)
print(tokenizer.decode(out[0].tolist()))

Training Details

| Metric | Value | |--------|-------| | Parameters | 29.7M | | Ternary Parameters | 26.5M | | Vocabulary Size | 8192 | | Model Dimension | 384 | | Layers | 18 | | Attention Heads | 8 | | Head Dimension | 48 | | FFN Dimension | 1152 | | Context Length | 256 | | Training Tokens | ~958M | | Training Time | ~40 hours | | Hardware | AMD Ryzen 7950X3D | | Final Loss | 0.306 | | Final PPL | 1.36 | | Final BPC | 0.44 |

πŸŽ‰ ACKNOWLEDGMENTS πŸŽ‰

Massive Thanks to arki05!!! πŸ™πŸ™πŸ™

arki05 provided the AMD Ryzen 7950X3D used for training this model!

Without arki05's generous contribution of their machine, this project would not have been possible. I would still be stuck using free tier compute!

THANK YOU ARKI05!!! ⚑⚑⚑

Comparison with Baselines

| Model | Params | PPL | Training | |-------|--------|-----|----------| | FlashLM v5 Thunderbolt | 29.7M | 1.36 | ~40h CPU | | TinyStories-1M (baseline) | 1M | 1.59 | ~24h GPU | | FlashLM v4 "Bolt" | 4.3M | 15.05 | 2h CPU | | FlashLM v5.2 "Nova-Ignition" | 5.0M | 10.56 | 2h CPU |

FlashLM v5 is the first CPU-trained model to beat the TinyStories-1M baseline while being trained on comparable compute!

Limitations

  • Trained only on TinyStories (synthetic short stories)
  • No chat capability
  • BPE tokenizer trained specifically for this dataset
  • CPU inference is slower than GPU

Citation

If you use this model, please cite:

@misc{flashlm-v5-thunderbolt,
  author = {Chang Cheng},
  title = {FlashLM v5 Thunderbolt: CPU-Based MatMul-Free Language Model},
  year = {2026},
  url = {https://github.com/changcheng967/FlashLM}
}

License

MIT License


FlashLM: Democratizing Language Model Research ⚑

Author: changcheng967

Likes: 2

Downloads: 44

Tags: cpu, matmul-free, language-model, ternary-weights, efficient, pytorch, region:us

marksverdhei/Qwen3-Voice-Embedding-12Hz-1.7B-onnx


license: apache-2.0 tags:

  • speaker-embedding
  • onnx
  • ecapa-tdnn
  • voice
  • x-vector base_model: marksverdhei/Qwen3-Voice-Embedding-12Hz-1.7B library_name: onnxruntime pipeline_tag: feature-extraction

Qwen3-Voice-Embedding-12Hz-1.7B (ONNX)

ONNX exports of the Qwen3-Voice-Embedding-12Hz-1.7B ECAPA-TDNN speaker encoder. Produces 2048-dimensional x-vector speaker embeddings from audio.

Three quantization variants are provided for different deployment targets:

| File | Format | Size | Use case | |------|--------|------|----------| | speaker_encoder_fp32.onnx | Float32 | 46 MB | Maximum accuracy | | speaker_encoder_fp16.onnx | Float16 | 23 MB | Browser / GPU inference (recommended) | | speaker_encoder_int8.onnx | Int8 | 12 MB | Edge / mobile / minimal footprint |


Usage

Python (ONNX Runtime)

import numpy as np
import onnxruntime as ort
import librosa

# Load model
session = ort.InferenceSession("speaker_encoder_fp32.onnx")

# Compute mel spectrogram (must match training preprocessing)
audio, sr = librosa.load("audio.wav", sr=24000, mono=True)
mel = librosa.feature.melspectrogram(
    y=audio, sr=24000, n_fft=1024, hop_length=256,
    n_mels=128, fmin=0, fmax=12000,
)
mel = np.log(np.clip(mel, a_min=1e-5, a_max=None))
mel = mel.T[np.newaxis, ...]  # (1, time, 128)

# Run inference
embedding = session.run(None, {"mel_spectrogram": mel.astype(np.float32)})[0]
# embedding shape: (1, 2048)

Browser (ONNX Runtime Web)

import * as ort from "onnxruntime-web";

const session = await ort.InferenceSession.create("speaker_encoder_fp16.onnx");

// mel: Float32Array of shape [1, time_steps, 128]
const tensor = new ort.Tensor("float32", mel, [1, timeSteps, 128]);
const results = await session.run({ mel_spectrogram: tensor });
const embedding = results.speaker_embedding.data; // Float32Array(2048)

Input / Output

| | Name | Shape | Type | |---|---|---|---| | Input | mel_spectrogram | (batch, time, 128) | float32 | | Output | speaker_embedding | (batch, 2048) | float32 |

The time axis is dynamic β€” any length mel spectrogram is accepted.

Audio Preprocessing

| Parameter | Value | |-----------|-------| | Sample rate | 24000 Hz | | FFT size | 1024 | | Hop length | 256 | | Mel bins | 128 | | Frequency range | 0 - 12000 Hz | | Mel scale | Slaney | | Compression | log(clamp(x, min=1e-5)) |


Export Details

Exported with torch.onnx.export (opset 18) from the standalone PyTorch model. Verified against PyTorch output (cosine similarity = 1.0, max abs diff < 3e-6).


Related Models


Citation

@article{ecapa-tdnn,
  title={ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification},
  author={Desplanques, Brecht and Thienpondt, Jenthe and Demuynck, Kris},
  journal={Proc. Interspeech},
  year={2020}
}

Author: marksverdhei

Likes: 1

Downloads: 0

Tags: onnxruntime, onnx, speaker-embedding, ecapa-tdnn, voice, x-vector, feature-extraction, base_model:marksverdhei/Qwen3-Voice-Embedding-12Hz-1.7B, base_model:quantized:marksverdhei/Qwen3-Voice-Embedding-12Hz-1.7B, license:apache-2.0, region:us

marksverdhei/Qwen3-Voice-Embedding-12Hz-0.6B-onnx


license: apache-2.0 tags:

  • speaker-embedding
  • onnx
  • ecapa-tdnn
  • voice
  • x-vector base_model: marksverdhei/Qwen3-Voice-Embedding-12Hz-0.6B library_name: onnxruntime pipeline_tag: feature-extraction

Qwen3-Voice-Embedding-12Hz-0.6B (ONNX)

ONNX exports of the Qwen3-Voice-Embedding-12Hz-0.6B ECAPA-TDNN speaker encoder. Produces 1024-dimensional x-vector speaker embeddings from audio.

Three quantization variants are provided for different deployment targets:

| File | Format | Size | Use case | |------|--------|------|----------| | speaker_encoder_fp32.onnx | Float32 | 35 MB | Maximum accuracy | | speaker_encoder_fp16.onnx | Float16 | 18 MB | Browser / GPU inference (recommended) | | speaker_encoder_int8.onnx | Int8 | 9 MB | Edge / mobile / minimal footprint |


Usage

Python (ONNX Runtime)

import numpy as np
import onnxruntime as ort
import librosa

# Load model
session = ort.InferenceSession("speaker_encoder_fp32.onnx")

# Compute mel spectrogram (must match training preprocessing)
audio, sr = librosa.load("audio.wav", sr=24000, mono=True)
mel = librosa.feature.melspectrogram(
    y=audio, sr=24000, n_fft=1024, hop_length=256,
    n_mels=128, fmin=0, fmax=12000,
)
mel = np.log(np.clip(mel, a_min=1e-5, a_max=None))
mel = mel.T[np.newaxis, ...]  # (1, time, 128)

# Run inference
embedding = session.run(None, {"mel_spectrogram": mel.astype(np.float32)})[0]
# embedding shape: (1, 1024)

Browser (ONNX Runtime Web)

import * as ort from "onnxruntime-web";

const session = await ort.InferenceSession.create("speaker_encoder_fp16.onnx");

// mel: Float32Array of shape [1, time_steps, 128]
const tensor = new ort.Tensor("float32", mel, [1, timeSteps, 128]);
const results = await session.run({ mel_spectrogram: tensor });
const embedding = results.speaker_embedding.data; // Float32Array(1024)

Input / Output

| | Name | Shape | Type | |---|---|---|---| | Input | mel_spectrogram | (batch, time, 128) | float32 | | Output | speaker_embedding | (batch, 1024) | float32 |

The time axis is dynamic β€” any length mel spectrogram is accepted.

Audio Preprocessing

| Parameter | Value | |-----------|-------| | Sample rate | 24000 Hz | | FFT size | 1024 | | Hop length | 256 | | Mel bins | 128 | | Frequency range | 0 - 12000 Hz | | Mel scale | Slaney | | Compression | log(clamp(x, min=1e-5)) |


Export Details

Exported with torch.onnx.export (opset 18) from the standalone PyTorch model. Verified against PyTorch output (cosine similarity > 0.9999).

The export script is available at: scripts/export_onnx.py


Related Models


Citation

@article{ecapa-tdnn,
  title={ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification},
  author={Desplanques, Brecht and Thienpondt, Jenthe and Demuynck, Kris},
  journal={Proc. Interspeech},
  year={2020}
}

Author: marksverdhei

Likes: 1

Downloads: 0

Tags: onnxruntime, onnx, speaker-embedding, ecapa-tdnn, voice, x-vector, feature-extraction, base_model:marksverdhei/Qwen3-Voice-Embedding-12Hz-0.6B, base_model:quantized:marksverdhei/Qwen3-Voice-Embedding-12Hz-0.6B, license:apache-2.0, region:us

the-fall-of-man/didact-v1rc3-mxfp4


language: en tags:

  • mlx
  • sillytavern
  • roleplaying pipeline_tag: text-generation library_name: mlx base_model:
  • MuXodious/gpt-oss-20b-RichardErkhov-heresy

Didact, release candidate 3

Work in progress. Teaching the model the basics of roleplaying so I can DPO it.

Probably bad, though tbh I don't hate it so far.

There's an mxfp4. More details when I can.

Use with mlx

pip install mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("the-fall-of-man/didact-v1rc3-mxfp4")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)

Author: the-fall-of-man

Likes: 1

Downloads: 0

Tags: mlx, safetensors, gpt_oss, sillytavern, roleplaying, text-generation, conversational, en, base_model:MuXodious/gpt-oss-20b-RichardErkhov-heresy, base_model:quantized:MuXodious/gpt-oss-20b-RichardErkhov-heresy, 4-bit, region:us