Todays AI Summary

Qwen3 Models Evolve: Enhanced Reasoning and Performance with New Releases

Today's AI landscape sees significant advancements in Large Language Models (LLMs), particularly with the Qwen3 family. New models and research papers highlight improvements in reasoning capabilities, evaluation methods, and real-world applications.

Research Papers

Several research papers address critical aspects of LLMs:

  • CompassVerifier: (arXiv:2508.03686) introduces a unified and robust verifier for LLM evaluation and outcome reward. The paper highlights the limitations of current evaluation frameworks and proposes CompassVerifier, a lightweight model demonstrating multi-domain competency across math, knowledge, and reasoning tasks.
  • Self-Questioning Language Models (SQLM): (arXiv:2508.03682) explores how LLMs can improve without external data by generating their own questions and answers. This asymmetric self-play framework trains models via reinforcement learning, showing improvements on benchmarks like algebra problems and coding challenges.
  • Agent Lightning: (arXiv:2508.03680) presents a flexible framework for RL-based training of LLMs for any AI agent. It decouples agent execution and training, allowing seamless integration with existing agents and demonstrating stable improvements across various tasks.

Model Updates

The Qwen3 family sees notable updates with the release of several new models:

  • Qwen3-4B-Instruct-2507: This model features significant improvements in instruction following, logical reasoning, text comprehension, and tool usage. It also boasts enhanced capabilities in long-context understanding (256K).
  • Qwen3-4B-Thinking-2507: This version focuses on scaling the thinking capability of Qwen3-4B, improving the quality and depth of reasoning. It shows significantly improved performance on reasoning tasks and general capabilities.
  • Qwen3-4B-Instruct-2507-FP8 & Qwen3-4B-Thinking-2507-FP8: These are FP8-quantized versions of the Instruct and Thinking models, respectively, offering performance benefits with fine-grained quantization.
  • flymy-ai/qwen-image-realism-lora: A LORA for Qwen-Image that enhances realism in generated images.
  • GGUF Quantizations: Several GGUF versions of Qwen3 models are available, optimized for use with tools like LM Studio and llama.cpp.

Key Takeaways

  • Enhanced Reasoning: The Qwen3-4B-Thinking-2507 model demonstrates significant improvements in reasoning tasks, making it suitable for complex problem-solving.
  • Improved Evaluation: CompassVerifier offers a more robust and generalizable approach to evaluating LLMs, addressing limitations in current methodologies.
  • Agentic Capabilities: Qwen3 models excel in tool calling and agentic use, with recommended tools like Qwen-Agent for optimal performance.
  • Quantization: FP8 and GGUF quantizations provide options for efficient deployment and local use, balancing performance and resource requirements.
  • Community Support: The availability of community models and tools like LM Studio and Ollama facilitates experimentation and deployment of Qwen3 models.

AI Papers for 2026-04-20

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

The rapid progress of Artificial Intelligence Generated Content (AIGC) tools enables images, videos, and visualizations to be created on demand for webpage design, offering a flexible and increasingly adopted paradigm for modern UI/UX. However, directly integrating such tools into automated webpage generation often leads to style inconsistency and poor global coherence, as elements are generated in isolation. We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection. MM-WebAgent jointly optimizes global layout, local multimodal content, and their integration, producing coherent and visually consistent webpages. We further introduce a benchmark for multimodal webpage generation and a multi-level evaluation protocol for systematic assessment. Experiments demonstrate that MM-WebAgent outperforms code-generation and agent-based baselines, especially on multimodal element generation and integration. Code & Data: https://aka.ms/mm-webagent.

Generalization in LLM Problem Solving: The Case of the Shortest Path

Whether language models can systematically generalize remains actively debated. Yet empirical performance is jointly shaped by multiple factors such as training data, training paradigms, and inference-time strategies, making failures difficult to interpret. We introduce a controlled synthetic environment based on shortest-path planning, a canonical composable sequential optimization problem. The setup enables clean separation of these factors and supports two orthogonal axes of generalization: spatial transfer to unseen maps and length scaling to longer-horizon problems. We find that models exhibit strong spatial transfer but consistently fail under length scaling due to recursive instability. We further analyze how distinct stages of the learning pipeline influence systematic problem-solving: for example, data coverage sets capability limits; reinforcement learning improves training stability but does not expand those limits; and inference-time scaling enhances performance but cannot rescue length-scaling failures.

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

LLM-as-judge frameworks are increasingly used for automatic NLG evaluation, yet their per-instance reliability remains poorly understood. We present a two-pronged diagnostic toolkit applied to SummEval: $\textbf{(1)}$ a transitivity analysis that reveals widespread per-input inconsistency masked by low aggregate violation rates ($\barρ = 0.8$-$4.1\%$), with $33$-$67\%$ of documents exhibiting at least one directed 3-cycle; and $\textbf{(2)}$ split conformal prediction sets over 1-5 Likert scores providing theoretically-guaranteed $\geq(1{-}α)$ coverage, with set width serving as a per-instance reliability indicator ($r_s = {+}0.576$, $N{=}1{,}918$, $p < 10^{-100}$, pooled across all judges). Critically, prediction set width shows consistent cross-judge agreement ($\bar{r} = 0.32$-$0.38$), demonstrating it captures document-level difficulty rather than judge-specific noise. Across four judges and four criteria, both diagnostics converge: criterion matters more than judge, with relevance judged most reliably (avg. set size $\approx 3.0$) and coherence moderately so (avg. set size $\approx 3.9$), while fluency and consistency remain unreliable (avg. set size $\approx 4.9$). We release all code, prompts, and cached results.

How Do LLMs and VLMs Understand Viewpoint Rotation Without Vision? An Interpretability Study

Over the past year, spatial intelligence has drawn increasing attention. Many prior works study it from the perspective of visual-spatial intelligence, where models have access to visuospatial information from visual inputs. However, in the absence of visual information, whether linguistic intelligence alone is sufficient to endow models with spatial intelligence, and how models perform relevant tasks with text-only inputs still remain unexplored. Therefore, in this paper, we focus on a fundamental and critical capability in spatial intelligence from a linguistic perspective: viewpoint rotation understanding (VRU). Specifically, LLMs and VLMs are asked to infer their final viewpoint and predict the corresponding observation in an environment given textual description of viewpoint rotation and observation over multiple steps. We find that both LLMs and VLMs perform poorly on our proposed dataset while human can easily achieve 100% accuracy, indicating a substantial gap between current model capabilities and the requirements of spatial intelligence. To uncover the underlying mechanisms, we conduct a layer-wise probing analysis and head-wise causal intervention. Our findings reveal that although models encode viewpoint information in the hidden states, they appear to struggle to bind the viewpoint position with corresponding observation, resulting in a hallucination in final layers. Finally, we selectively fine-tune the key attention heads identified by causal intervention to improve VRU performance. Experimental results demonstrate that such selective fine-tuning achieves improved VRU performance while avoiding catastrophic forgetting of generic abilities. Our dataset and code will be released at https://github.com/Young-Zhen/VRU_Interpret .

AD4AD: Benchmarking Visual Anomaly Detection Models for Safer Autonomous Driving

The reliability of a machine vision system for autonomous driving depends heavily on its training data distribution. When a vehicle encounters significantly different conditions, such as atypical obstacles, its perceptual capabilities can degrade substantially. Unlike many domains where errors carry limited consequences, failures in autonomous driving translate directly into physical risk for passengers, pedestrians, and other road users. To address this challenge, we explore Visual Anomaly Detection (VAD) as a solution. VAD enables the identification of anomalous objects not present during training, allowing the system to alert the driver when an unfamiliar situation is detected. Crucially, VAD models produce pixel-level anomaly maps that can guide driver attention to specific regions of concern without requiring any prior assumptions about the nature or form of the hazard. We benchmark eight state-of-the-art VAD methods on AnoVox, the largest synthetic dataset for anomaly detection in autonomous driving. In particular, we evaluate performance across four backbone architectures spanning from large networks to lightweight ones such as MobileNet and DeiT-Tiny. Our results demonstrate that VAD transfers effectively to road scenes. Notably, Tiny-Dinomaly achieves the best accuracy-efficiency trade-off for edge deployment, matching full-scale localization performance at a fraction of the memory cost. This study represents a concrete step toward safer, more responsible deployment of autonomous vehicles, ultimately improving protection for passengers, pedestrians, and all road users.

Why Do Vision Language Models Struggle To Recognize Human Emotions?

Understanding emotions is a fundamental ability for intelligent systems to be able to interact with humans. Vision-language models (VLMs) have made tremendous progress in the last few years for many visual tasks, potentially offering a promising solution for understanding emotions. However, it is surprising that even the most sophisticated contemporary VLMs struggle to recognize human emotions or to outperform even specialized vision-only classifiers. In this paper we ask the question "Why do VLMs struggle to recognize human emotions?", and observe that the inherently continuous and dynamic task of facial expression recognition (DFER) exposes two critical VLM vulnerabilities. First, emotion datasets are naturally long-tailed, and the web-scale data used to pre-train VLMs exacerbates this head-class bias, causing them to systematically collapse rare, under-represented emotions into common categories. We propose alternative sampling strategies that prevent favoring common concepts. Second, temporal information is critical for understanding emotions. However, VLMs are unable to represent temporal information over dense frame sequences, as they are limited by context size and the number of tokens that can fit in memory, which poses a clear challenge for emotion recognition. We demonstrate that the sparse temporal sampling strategy used in VLMs is inherently misaligned with the fleeting nature of micro-expressions (0.25-0.5 seconds), which are often the most critical affective signal. As a diagnostic probe, we propose a multi-stage context enrichment strategy that utilizes the information from "in-between" frames by first converting them into natural language summaries. This enriched textual context is provided as input to the VLM alongside sparse keyframes, preventing attentional dilution from excessive visual data while preserving the emotional trajectory.

Prism: Symbolic Superoptimization of Tensor Programs

This paper presents Prism, the first symbolic superoptimizer for tensor programs. The key idea is sGraph, a symbolic, hierarchical representation that compactly encodes large classes of tensor programs by symbolically representing some execution parameters. Prism organizes optimization as a two-level search: it constructs symbolic graphs that represent families of programs, and then instantiates them into concrete implementations. This formulation enables structured pruning of provably suboptimal regions of the search space using symbolic reasoning over operator semantics, algebraic identities, and hardware constraints. We develop techniques for efficient symbolic graph generation, equivalence verification via e-graph rewriting, and parameter instantiation through auto-tuning. Together, these components allow Prism to bridge the rigor of exhaustive search with the scalability required for modern ML workloads. Evaluation on five commonly used LLM workloads shows that Prism achieves up to $2.2\times$ speedup over best superoptimizers and $4.9\times$ over best compiler-based approaches, while reducing end-to-end optimization time by up to $3.4\times$.

SegWithU: Uncertainty as Perturbation Energy for Single-Forward-Pass Risk-Aware Medical Image Segmentation

Reliable uncertainty estimation is critical for medical image segmentation, where automated contours feed downstream quantification and clinical decision support. Many strong uncertainty methods require repeated inference, while efficient single-forward-pass alternatives often provide weaker failure ranking or rely on restrictive feature-space assumptions. We present $\textbf{SegWithU}$, a post-hoc framework that augments a frozen pretrained segmentation backbone with a lightweight uncertainty head. SegWithU taps intermediate backbone features and models uncertainty as perturbation energy in a compact probe space using rank-1 posterior probes. It produces two voxel-wise uncertainty maps: a calibration-oriented map for probability tempering and a ranking-oriented map for error detection and selective prediction. Across ACDC, BraTS2024, and LiTS, SegWithU is the strongest and most consistent single-forward-pass baseline, achieving AUROC/AURC of $0.9838/2.4885$, $0.9946/0.2660$, and $0.9925/0.8193$, respectively, while preserving segmentation quality. These results suggest that perturbation-based uncertainty modeling is an effective and practical route to reliability-aware medical segmentation. Source code is available at https://github.com/ProjectNeura/SegWithU.

CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas

It is increasingly important that LLM agents interact effectively and safely with other goal-pursuing agents, yet, recent works report the opposite trend: LLMs with stronger reasoning capabilities behave _less_ cooperatively in mixed-motive games such as the prisoner's dilemma and public goods settings. Indeed, our experiments show that recent models -- with or without reasoning enabled -- consistently defect in single-shot social dilemmas. To tackle this safety concern, we present the first comparative study of game-theoretic mechanisms that are designed to enable cooperative outcomes between rational agents _in equilibrium_. Across four social dilemmas testing distinct components of robust cooperation, we evaluate the following mechanisms: (1) repeating the game for many rounds, (2) reputation systems, (3) third-party mediators to delegate decision making to, and (4) contract agreements for outcome-conditional payments between players. Among our findings, we establish that contracting and mediation are most effective in achieving cooperative outcomes between capable LLM models, and that repetition-induced cooperation deteriorates drastically when co-players vary. Moreover, we demonstrate that these cooperation mechanisms become _more effective_ under evolutionary pressures to maximize individual payoffs.

Stability and Generalization in Looped Transformers

Looped transformers promise test-time compute scaling by spending more iterations on harder problems, but it remains unclear which architectural choices let them extrapolate to harder problems at test time rather than memorize training-specific solutions. We introduce a fixed-point based framework for analyzing looped architectures along three axes of stability -- reachability, input-dependence, and geometry -- and use it to characterize when fixed-point iteration yields meaningful predictions. Theoretically, we prove that looped networks without recall have countable fixed points and cannot achieve strong input-dependence at any spectral regime, while recall combined with outer normalization reliably produces a regime in which fixed points are simultaneously reachable, locally smooth in the input, and supported by stable backpropagation. Empirically, we train single-layer looped transformers on chess, sudoku, and prefix-sums and find that downstream performance tracks the framework's predictions across tasks and architectural configurations. We additionally introduce internal recall, a novel recall placement variant, and show that it becomes competitive with -- and on sudoku, substantially better than -- standard recall placement once outer normalization is applied.

AI Models

Intel/Qwen3.6-35B-A3B-int4-AutoRound


base_model:

  • Qwen/Qwen3.6-35B-A3B

Model Details

This model is a int4 model with group_size 128 of Qwen/Qwen3.6-35B-A3B generated by intel/auto-round. Please follow the license of the original model.

vllm Infernece Example

pip install git+https://github.com/vllm-project/vllm.git@main


pip install git+https://github.com/huggingface/transformers.git
vllm serve Intel/Qwen3.6-35B-A3B-int4-AutoRound  --port 8000   --tensor-parallel-size 1  --max-model-len 2048 --reasoning-parser qwen3 --served-model-name qwen
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d ' {
    "model": "qwen",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Summarize Qwen 3.6 in one sentence."}
    ],
    "temperature": 1,
    "max_tokens": 512
  } '

Transformers Inference

Transformers v5.5.1

auto-round and This pr is required https://github.com/intel/auto-round/pull/1705

from transformers import AutoProcessor, Qwen3_5MoeForConditionalGeneration
model_name = "Intel/Qwen3.6-35B-A3B-int4-AutoRound"

model = Qwen3_5MoeForConditionalGeneration.from_pretrained(model_name, dtype="auto",
                                                                device_map="auto")
processor = AutoProcessor.from_pretrained(model_name)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image in short."},
        ],
    }
]


inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)


generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
print(processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0])
"""
The user wants a short description of the image.

1.  **Identify the main subjects:** A woman and a large yellow dog (looks like a Labrador).
2.  **Identify the setting:** A sandy beach with the ocean in the background. The lighting suggests sunset or sunrise (golden hour).
3.  **Identify the action:** The dog is sitting and placing its paw in the woman's hand. The woman is smiling and sitting on the sand.
4.  **Synthesize into a short description:** A woman sitting on a beach at sunset gives a high-five (or "shake") to

"""

Generate the Model

This pr is required https://github.com/intel/auto-round/pull/1705

auto-round  "Qwen/Qwen3.6-35B-A3B"  --output_dir "./Qwen36-int4" --ignore_layers shared_expert,mtp.fc

Ethical Considerations and Limitations

The model can produce factually incorrect output, and should not be relied on to produce factually accurate information. Because of the limitations of the pretrained model and the finetuning datasets, it is possible that this model could generate lewd, biased or otherwise offensive outputs.

Therefore, before deploying any applications of the model, developers should perform safety testing.

Caveats and Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

Here are a couple of useful links to learn more about Intel's AI software:

Disclaimer

The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please consult an attorney before using this model for commercial purposes.

Cite

@article{cheng2023optimize, title={Optimize weight rounding via signed gradient descent for the quantization of llms}, author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi}, journal={arXiv preprint arXiv:2309.05516}, year={2023} }

arxiv [github](

Author: Intel

Likes: 8

Downloads: 0

Tags: safetensors, qwen3_5_moe, arxiv:2309.05516, base_model:Qwen/Qwen3.6-35B-A3B, base_model:quantized:Qwen/Qwen3.6-35B-A3B, 4-bit, auto-round, region:us

samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B


language:

  • en license: apache-2.0 tags:
  • qwen3.6
  • moe
  • hermes
  • agentic
  • tool-calling
  • qlora
  • unsloth
  • carnice base_model: Qwen/Qwen3.6-35B-A3B datasets:
  • bespokelabs/Bespoke-Stratos-17k
  • AI-MO/NuminaMath-CoT
  • kai-os/carnice-glm5-hermes-traces
  • open-thoughts/OpenThoughts-Agent-v1-SFT

Carnice Qwen3.6 MoE 35B-A3B — Hermes-Focused Agentic Model

QLoRA fine-tune of Qwen3.6-35B-A3B (MoE, 3B active parameters) optimized for agentic workflows and Hermes Agent runtime. Two-stage training adapted from kai-os/Carnice-9b.

This is the successor to Carnice-MoE-35B-A3B (based on Qwen3.5), retrained on the newer Qwen3.6 base which brings improved agentic coding, extended context (262K native, up to 1M with RoPE scaling), and native multimodal support.

Credits

Training methodology adapted from kai-os/Carnice-9b — same two-stage approach and datasets, applied to the larger MoE architecture. Key inspiration: training on actual Hermes Agent execution traces for native agentic behavior.

Available Formats

| Format | Size | Location | Use Case | |---|---|---|---| | BF16 SafeTensors | 67 GB | Root | Full precision, Transformers / vLLM | | FP8 Dynamic | 34 GB | fp8/ | vLLM optimized, ~2x faster inference | | GGUF | 19-65 GB | GGUF repo | llama.cpp, Ollama, LM Studio |

FP8 Usage (vLLM)

# Clone the repo and point vLLM to the fp8/ subfolder
vllm serve samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B --quantization fp8 --dtype auto

Model Details

| Property | Value | |---|---| | Base Model | Qwen/Qwen3.6-35B-A3B | | Architecture | Mixture of Experts (MoE) | | Total Parameters | ~35B | | Active Parameters | ~3B per token | | Native Context Length | 262,144 tokens | | Thinking Modes | Thinking / Non-thinking (native Qwen3.6) |

What Makes This Different

Unlike generic reasoning distillation, this model was trained on actual Hermes Agent execution traces — real conversations where an AI agent:

  • Executes terminal commands and processes output
  • Performs file editing operations
  • Chains multi-step tool calls with results feeding back
  • Uses browser-assisted workflows
  • Makes decisions based on environmental feedback

This teaches the model the exact conversation patterns Hermes expects, rather than just generic reasoning.

Training Details

Two-Stage Approach

Stage A — Reasoning Repair (1 epoch)

  • Strengthens base model reasoning before agent-specific training
  • Loss: 0.4281

| Dataset | Examples | |---|---| | bespokelabs/Bespoke-Stratos-17k | 16,710 | | AI-MO/NuminaMath-CoT | 17,000 (capped) |

Stage B — Hermes Traces (2 epochs)

  • Agent-specific behavioral training on real execution traces
  • Loss: 0.3045

| Dataset | Examples | |---|---| | kai-os/carnice-glm5-hermes-traces | 1,627 (high quality) | | open-thoughts/OpenThoughts-Agent-v1-SFT | 15,209 |

Training Configuration

| Parameter | Stage A | Stage B | |---|---|---| | LoRA Rank | 64 | 64 | | LoRA Alpha | 64 | 64 | | LoRA Targets | q, k, v, o projections | q, k, v, o projections | | Learning Rate | 2e-5 (linear) | 1e-5 (cosine) | | Epochs | 1 | 2 | | Effective Batch | 12 | 12 | | Context Length | 4096 | 4096 | | Precision | 4-bit QLoRA + BF16 adapters | Same | | GPU | RTX PRO 6000 Blackwell (98GB) | Same | | Total Training Time | ~55 hours (both stages) |

Trainable Parameters

13,762,560 (0.04% of 35.1B total)

Usage

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B",
    torch_dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B")

messages = [{"role": "user", "content": "Explain the Riemann hypothesis in simple terms."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

vLLM

vllm serve samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B --dtype auto --max-model-len 262144

llama.cpp

For llama.cpp usage, see the GGUF repo.

Acknowledgements

  • kai-os — Carnice training methodology and Hermes traces dataset
  • open-thoughts — Agent SFT dataset
  • bespokelabs — Bespoke-Stratos reasoning dataset
  • Unsloth — QLoRA training framework
  • Qwen — Base model

Author: samuelcardillo

Likes: 6

Downloads: 0

Tags: hermes, safetensors, qwen3_5_moe, qwen3.6, moe, agentic, tool-calling, qlora, unsloth, carnice, en, dataset:bespokelabs/Bespoke-Stratos-17k, dataset:AI-MO/NuminaMath-CoT, dataset:kai-os/carnice-glm5-hermes-traces, dataset:open-thoughts/OpenThoughts-Agent-v1-SFT, base_model:Qwen/Qwen3.6-35B-A3B, base_model:finetune:Qwen/Qwen3.6-35B-A3B, license:apache-2.0, region:us

CompactAI-O/TMLM-Haiku-2.3


license: mit

TMLM-Haiku-2.3

It speaks. It actually speaks. Mostly.

We have come so far. From the dark ages of couldcouldoldbloodblood to actual, coherent, structured sentences. This is TMLM-Haiku-2.3. It is 1 million parameters. It is small. It is trying its best. And unlike its ancestors, it usually succeeds.

Quick Stats

  • Parameters: 1,000,000 (Yes, really. 1M.)
  • Training Tokens: 10 Billion
  • Context Window: 2048 tokens
  • Vibe: Chaotic good, but mostly good.

What Is This?

Haiku-2.3 is the latest evolution of the TMLM-Haiku series. It builds on Haiku-2 by adding SPIN (Self-Play Fine-Tuning) to the training loop. This model represents a 3x improvement in combined performance score over the original Haiku. Coherence has jumped from 1.99 to 6.03. Relevance is no longer zero. It is a miracle.

The Journey

| Model | Era | Typical Output | Combined Score | | :--- | :--- | :--- | :--- | | Haiku-1 | The Dark Ages | couldcouldoldbloodbloodbodybody | 1.62 | | Haiku-1.3 | The Pipe Character Incident | \|fdish\|\|\|\|\|!@\| | 1.21 | | Haiku-2 | The Awakening | It is about **competent development**... | 3.87 | | Haiku-2.3 (SPIN) | Current Era | The artificial intelligence is a problem... | 4.84 ★ |

Expected Output:

"The simple terms arrived in simulant explorers and honey are specific or forecasters. They allow the structure of their similar..."

Disclaimer

This is a 1 million parameter model.

  • It is not GPT-5.
  • It is not GPT-2.
  • It is a tiny neural network running on a prayer and a GPU.
  • It might still output chuamliamce occasionally. If it does, just try again. It is shy.
  • For best results, use temperature around 0.7. If you crank it to 2.0, you are on your own.

Benchmarks

We benchmarked Haiku-2.3 against all previous versions using a standard 7-question suite.

| Metric | Haiku-1 | Haiku-1.3 | Haiku-2 | Haiku-2.3 (SPIN) | | :--- | :---: | :---: | :---: | :---: | | Fluency | 0.50 | 1.69 | 8.35 | 8.78 | | Coherence | 1.99 | 1.56 | 5.72 | 6.03 | | Relevance | 1.22 | 0.00 | 0.00 | 2.25 | | Format | 3.29 | 3.29 | 3.29 | 3.29 | | Combined | 1.62 | 1.21 | 3.87 | 4.84 |

Related Models

Check out the rest of the family:

Acknowledgments

Built with curiosity over compute. Trained on FineWeb-Edu. SPIN optimized. And a lot of hope.


Built by CompactAI. If you like tiny models that try their best, give us a follow.

Author: CompactAI-O

Likes: 3

Downloads: 0

Tags: license:mit, region:us

doggy8088/ggml-breeze-asr-26


license: apache-2.0 library_name: whisper.cpp tags:

  • whisper
  • whisper.cpp
  • ggml
  • coreml
  • automatic-speech-recognition
  • speech-recognition

ggml-breeze-asr-26

This repository contains whisper.cpp-compatible conversion artifacts for MediaTek-Research/Breeze-ASR-26.

Included files:

  • ggml-breeze-asr-26.bin: ggml model for whisper.cpp
  • ggml-breeze-asr-26-encoder.mlmodelc/: Core ML encoder compiled for Apple platforms

Upstream model

  • Base architecture: openai/whisper-large-v2
  • Upstream fine-tuned model: MediaTek-Research/Breeze-ASR-26
  • Original license: Apache-2.0

Usage with whisper.cpp

Build whisper.cpp with Core ML enabled:

cmake -B build -DWHISPER_COREML=1
cmake --build build -j --config Release

Run transcription with the converted model:

./build/bin/whisper-cli -m models/ggml-breeze-asr-26.bin -f samples/jfk.wav

When the .mlmodelc directory is placed next to the .bin file, whisper.cpp will load it automatically on Apple platforms.

Conversion

These artifacts were generated with ggml-org/whisper.cpp using:

./models/convert-hf-whisper-model.sh MediaTek-Research/Breeze-ASR-26

Author: doggy8088

Likes: 2

Downloads: 0

Tags: whisper.cpp, whisper, ggml, coreml, automatic-speech-recognition, speech-recognition, license:apache-2.0, region:us

dk2325/whisper-tiny-indian-accent


library_name: transformers tags: []

Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->

Model Details

Model Description

<!-- Provide a longer summary of what this model is. -->

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

  • Developed by: [More Information Needed]
  • Funded by [optional]: [More Information Needed]
  • Shared by [optional]: [More Information Needed]
  • Model type: [More Information Needed]
  • Language(s) (NLP): [More Information Needed]
  • License: [More Information Needed]
  • Finetuned from model [optional]: [More Information Needed]

Model Sources [optional]

<!-- Provide the basic links for the model. -->
  • Repository: [More Information Needed]
  • Paper [optional]: [More Information Needed]
  • Demo [optional]: [More Information Needed]

Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

Direct Use

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

[More Information Needed]

Downstream Use [optional]

<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->

[More Information Needed]

Out-of-Scope Use

<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->

[More Information Needed]

Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

[More Information Needed]

Recommendations

<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

Use the code below to get started with the model.

[More Information Needed]

Training Details

Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

[More Information Needed]

Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

Preprocessing [optional]

[More Information Needed]

Training Hyperparameters

  • Training regime: [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->

Speeds, Sizes, Times [optional]

<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->

[More Information Needed]

Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

Testing Data, Factors & Metrics

Testing Data

<!-- This should link to a Dataset Card if possible. -->

[More Information Needed]

Factors

<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->

[More Information Needed]

Metrics

<!-- These are the evaluation metrics being used, ideally with a description of why. -->

[More Information Needed]

Results

[More Information Needed]

Summary

Model Examination [optional]

<!-- Relevant interpretability work for the model goes here -->

[More Information Needed]

Environmental Impact

<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: [More Information Needed]
  • Hours used: [More Information Needed]
  • Cloud Provider: [More Information Needed]
  • Compute Region: [More Information Needed]
  • Carbon Emitted: [More Information Needed]

Technical Specifications [optional]

Model Architecture and Objective

[More Information Needed]

Compute Infrastructure

[More Information Needed]

Hardware

[More Information Needed]

Software

[More Information Needed]

Citation [optional]

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

BibTeX:

[More Information Needed]

APA:

[More Information Needed]

Glossary [optional]

<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->

[More Information Needed]

More Information [optional]

[More Information Needed]

Model Card Authors [optional]

[More Information Needed]

Model Card Contact

[More Information Needed]

Author: dk2325

Likes: 2

Downloads: 0

Tags: transformers, safetensors, whisper, automatic-speech-recognition, arxiv:1910.09700, endpoints_compatible, region:us

siraxe/VBVR-LTX2.3-diffsynth_comfyui


language:

  • en tags:
  • lora
  • diffusers
  • template:diffusion-lora
  • ltx-2.3
  • image2video
  • img2vid
  • IC-LoRA base_model:
  • Lightricks/LTX-2.3 instance_prompt: null license: apache-2.0 pipeline_tag: image-to-video pretty_name: VBVR-LTX2.3-diffsynth

Taken from https://huggingface.co/Video-Reason/VBVR-LTX2.3-diffsynth

Converted keys for comfyui

Author: siraxe

Likes: 2

Downloads: 0

Tags: diffusers, lora, template:diffusion-lora, ltx-2.3, image2video, img2vid, IC-LoRA, image-to-video, en, base_model:Lightricks/LTX-2.3, base_model:adapter:Lightricks/LTX-2.3, license:apache-2.0, region:us

saricles/MiniMax-M2.7-NVFP4-GB10-AC


base_model:

  • MiniMaxAI/MiniMax-M2.7 library_name: transformers license: other pipeline_tag: text-generation language:
  • en
  • zh tags:
  • minimax
  • nvfp4
  • 4-bit
  • quantized
  • compressed-tensors
  • vllm
  • DGX-Spark
  • GB10
  • MoE
  • agentic
  • tool-use
  • code

MiniMax-M2.7-NVFP4-GB10-AC

Agentic + Coder recalibration of MiniMax-M2.7 NVFP4-GB10. Same architecture and quantization scheme as saricles/MiniMax-M2.7-NVFP4-GB10, but calibrated on a 7-dataset mix targeted at agentic tool-use and code-generation workloads instead of general chat. The two are parallel variants of the same quant approach — sibling releases, not a version chain.

Custom GB10 NVFP4 quantization of MiniMaxAI/MiniMax-M2.7 (230B, 256 MoE experts, top-K=8) targeted at NVIDIA DGX Spark (GB10) and Blackwell-family hardware. 141.05 GB on disk across 29 shards.

Why -AC? Why re-calibrate?

Post-training NVFP4 quantization depends on a calibration dataset to set per-layer activation scales (amax values). A 4-bit float format has 16 representable values — calibration determines how the full BF16 activation range at each layer is mapped to those 16 bins.

If calibration data doesn't match the target workload, real-world activations outside the calibrated range get clipped → quality loss on those inputs.

  • NVFP4-GB10 calibrated on HuggingFaceH4/ultrachat_200k (general multi-turn English chat, 64 samples)
  • NVFP4-GB10-AC calibrated on a 7-dataset agentic + coder mix (896 samples queued, 888 after length filtering)

The -AC calibration mix is designed to align activation scales with the workloads the model will actually serve when deployed in agent frameworks like OpenClaw, Aider, or Claude Code-style assistants.

Calibration mix

128 samples per dataset, 49,152 (48K) max sequence length:

| Dataset | Samples | Domain | |---|---:|---| | theblackcat102/evol-codealpaca-v1 | 128 | Code generation | | Salesforce/xlam-function-calling-60k | 128 | Tool calling / function invocation | | open-r1/Mixture-of-Thoughts (code) | 128 | Code reasoning | | open-r1/Mixture-of-Thoughts (math) | 128 | Mathematical reasoning | | open-r1/Mixture-of-Thoughts (science) | 128 | Scientific reasoning | | SWE-bench/SWE-smith-trajectories (tool split) | 128 | Software-engineering agent trajectories | | HuggingFaceH4/ultrachat_200k (train_sft) | 128 | General multi-turn chat coverage | | Total queued | 896 | — | | Tokenized (post length-filter) | 888 | 8 dropped as too-short after tokenization |

The 7th dataset (ultrachat_200k) is intentional: without a general-chat anchor, calibration would bias exclusively toward code/tool/math distributions and degrade plain conversational quality. The mix preserves chat capability while shifting activation scales toward the agentic/coder workloads this quant is built for.

Model Details

| | | |---|---| | Base Model | MiniMaxAI/MiniMax-M2.7 | | Architecture | MiniMaxM2ForCausalLM (MoE, 256 experts, top-K=8) | | Total Parameters | 230B | | Active Parameters | ~10B per token | | Hidden Layers | 62 | | Hidden Size | 3,072 | | Vocab Size | 200,064 | | Max Position Embeddings | 196,608 (192K context) | | Quantization | NVFP4 (4-bit floating point) with GB10-tuned ignore list | | Format | compressed-tensors (safetensors) | | Size on Disk | 141.05 GB across 29 shards | | Deployment | 2× DGX Spark (does not fit in a single 128 GB Spark) | | License | Non-commercial, inherited from MiniMaxAI/MiniMax-M2.7. See Use & License. |

Quantization Details

  • Method: Post-training quantization via NVIDIA TensorRT Model Optimizer (nvidia-modelopt 0.29.0)
  • Transformers: 4.57.6 (with Conv1D compatibility shim for post-4.57 module relocation)
  • Scheme: mtq.NVFP4_DEFAULT_CFG (algorithm=max, group_size=16) + GB10-tuned disable list applied post-calibration
  • Calibration: 7-dataset agentic + coder mix (see table above), 896 samples queued / 888 tokenized @ 49,152 max-seq
  • Ignore list (kept in BF16, from published hf_quant_config.json):
    • lm_head, *embed_tokens*
    • *block_sparse_moe.gate — MoE router gate (not per-expert gates)
    • *model.layers.0.* — first transformer block
    • *model.layers.61.* — last transformer block
  • Quantizer counts: 143,967 TensorQuantizer modules inserted, 51,327 disabled via ignore list, 92,640 active during calibration
  • GB10 specialization: self_attn stays QUANTIZED (vs. the standard NVFP4 reference configuration which keeps attention BF16) — the GB10 ignore list only covers the items listed above
  • Calibration run: Hugging Face Jobs, 8× NVIDIA A100 80 GB, ~10 hours wall-clock, single-phase (no wallclock-cap, no deferred samples, no OOMs)
  • Starvation check: 0 starved experts at end of calibration (every active quantizer received enough token traffic to produce a valid amax)
  • Recipe script: quantize-ac-protected.py — full three-phase recipe with OOM-defer protection, amax-only checkpointing, and inline export

Running on 2× DGX Spark (Tensor Parallel)

At 141.05 GB this model does not fit in a single DGX Spark's 128 GB unified memory. It runs with tensor-parallel-size=2 across two Sparks connected via their ConnectX-7 200 GbE link, orchestrated by Ray. The community reference container is eugr/spark-vllm-docker.

Quick start: run_vllm.sh is a ready-to-run wrapper — exports the tuned environment variables and invokes vllm serve with the working flag set.

Full deployment reference: DEPLOYMENT.md documents the two deployment profiles we tested, measured numbers, and known hardware/framework quirks specific to GB10 (SM 12.1) and multi-node Ray TP.

The short version: on GB10 the fastest NVFP4 MoE path is the Marlin backend (VLLM_NVFP4_GEMM_BACKEND=marlin, VLLM_USE_FLASHINFER_MOE_FP4=0), and if your workload is agentic (tool-calling, code generation, repeated-token-heavy) you should additionally enable ngram speculative decoding. See DEPLOYMENT.md for the full rationale and benchmark data.

Client-side tips

Every client that calls this endpoint should set max_tokens ≥ 16384. The OpenAI SDK's default of 4096 will silently truncate tool-call JSON mid-string, which appears as "model forgot how to use tools" but is actually just a clipped response. Bump it.

When to choose -AC vs NVFP4-GB10

  • Use -AC for: agent frameworks (OpenClaw, Aider, Claude Code-style), tool-calling workloads, code-generation assistants, multi-turn reasoning over code/math.
  • Use NVFP4-GB10 for: general chat applications, scenarios where the calibration-dataset provenance matches the published NVFP4-GB10 benchmarks exactly.

Both variants are mechanically compatible (same vLLM invocation, same compressed-tensors format). Only the per-layer NVFP4 activation scales differ — size on disk, architecture, ignore list, and deployment are unchanged.

Performance

Benchmarked on 2× NVIDIA DGX Spark (GB10), TP=2 via Ray over QSFP56 RoCE, using llama-benchy v0.3.3. Measured 2026-04-19 with the tuned config shown above (including VLLM_USE_FLASHINFER_MOE_FP4=1, SoC firmware ≥2.148.24, --gpu-memory-utilization 0.88).

Measured on 2× NVIDIA DGX Spark (GB10, SM 12.1), TP=2 over QSFP56 RoCE, post-firmware SoC 2.148.24. vLLM 0.19.1rc1.dev241 via the eugr/spark-vllm-docker nightly image. Two deployment profiles documented in DEPLOYMENT.md. Numbers below are observed on this rig; your mileage depends on build, image, and workload.

Profile 1 — Throughput-stable (Marlin NVFP4 MoE, no speculation)

Benchmarked with llama-benchy v0.3.3, 3 runs per config, warm model, single client.

| Prompt (tok) | Gen (tok) | Prefill (tok/s) | Decode (tok/s) | TTFT (ms) | |---:|---:|---:|---:|---:| | 512 | 128 | 1,128 | 35.44 | 454 | | 512 | 256 | 1,248 | 35.86 | 410 | | 1024 | 128 | 2,049 | 35.03 | 500 | | 1024 | 256 | 2,132 | 34.50 | 480 | | 4096 | 128 | 2,817 | 33.76 | 1,454 | | 4096 | 256 | 3,314 | 33.45 | 1,236 |

API latency: 1.50 ms. Peak decode: 35.86 tok/s.

Profile 2 — Agentic (Marlin NVFP4 MoE + ngram speculative decoding)

Measured on a 12-prompt agent-flavored set (code generation, tool calls, short chat) — not a standard benchmark; it approximates real agent-framework traffic. Same hardware, same sampling, only the serving config differs.

| Metric | Throughput-stable profile | Agentic profile | |---|---:|---:| | Average decode across 12 prompts (tok/s) | 25.20 | 36.44 | | Peak decode (tok/s) | 35.86 | 48.34 (code-04: async-pattern) | | Total wall-clock for full prompt set (s) | 250.8 | 162.7 | | Wall-clock speedup (Agentic vs Throughput-stable) | — | 1.54× |

Per-task wall-clock table, DEPLOYMENT.md has the full breakdown: code-02 (MBPP-style) 2.13× faster; code-04 (async pattern) 1.90× faster; chat-03 (creative writing) 2.06× faster; tool-04 (don't-call-tool trap) 1.96× faster.

Why the two profiles differ: ngram speculative decoding wins big when responses contain repeated tokens (tool names, file paths, variable names, JSON keys reappearing) — which agent/code workloads have abundantly. On synthetic benchmarks with low token repetition (like llama-benchy's generated prompts), ngram's overhead slightly exceeds its savings and decode regresses. DEPLOYMENT.md documents this tradeoff and when to pick each profile.

Qualitative — agentic behavior vs NVFP4-GB10 sibling

Same prompt set, same sampling, compared against the general-chat-calibrated sibling variant:

| Task | NVFP4-GB10 tokens | NVFP4-GB10-AC tokens | Wall-clock speedup (AC) | |---|---:|---:|---:| | "Answer directly; don't call the provided tool" (trap) | 718 | 44 | 14.7× | | Multi-step meeting booking (3 tools) | 385 | 81 | 4.6× | | Weather (single tool) | 73 | 51 | 2.5× | | Parallel stock prices (parallel tool calls) | 176 | 121 | 1.4× |

AC is measurably more decisive on tool-use tasks — it emits cleaner, shorter tool calls and, crucially, doesn't over-invoke tools when direct answers suffice. Raw throughput (decode/prefill) is within noise of NVFP4-GB10 as expected — quant format is identical, only activation scales differ. The meaningful delta between AC and GB10 is qualitative on agentic tasks, not numeric on raw throughput.

Notes

  • See DEPLOYMENT.md for the full environment, flags, caveats, and why Marlin is the right MoE backend on SM 12.1 GB10 today.
  • For published standardized benchmarks (HumanEval, BFCL, MT-Bench, WildClawBench), see forthcoming evaluation runs.

Recommended Sampling Parameters

Per MiniMax documentation:

{
  "temperature": 1.0,
  "top_p": 0.95,
  "top_k": 40,
  "min_p": 0.01
}

Target Hardware

Quantized for and tested on NVIDIA DGX Spark (GB10, 128 GB unified memory, 221 GB/s bandwidth). Should work on other Blackwell-class GPUs with NVFP4 tensor-core support. On Hopper-class hardware (H100/H200) the model will load and run, but the ignore list was tuned for Blackwell and will leave some performance on the table.

If you only have one DGX Spark

At 141.05 GB this model does not fit in a single Spark's 128 GB unified memory — it requires 2× Spark with tensor parallelism. If you have only one Spark, consider the REAP-pruned variant: saricles/MiniMax-M2.7-REAP-172B-A10B-NVFP4-GB10 (98.9 GB, single-node deployment).

Use & License

This derivative inherits the license terms of the base model, MiniMaxAI/MiniMax-M2.7. The full license text is distributed in the LICENSE file in this repo.

Permitted free uses (from §5 of the base license): personal use — including self-hosted deployment for coding, development of applications, agents, tools, integrations, research, experimentation, or other personal purposes; use by non-profit organizations, academic institutions, and researchers for non-commercial research or educational purposes; and modifications for the uses above.

Commercial use requires authorization directly from MiniMax. If you intend to use this model (or any derivative) for commercial purposes — including offering products/services to third parties for a fee, commercial-product APIs, or commercial deployment — you must:

  1. Obtain prior written authorization from MiniMax by emailing api@minimax.io with subject line "M2.7 licensing", and
  2. Prominently display "Built with MiniMax M2.7" on the related website, user interface, blogpost, about page, or product documentation.

Prohibited uses (from the license appendix) — by using this model you agree not to use it to: generate or disseminate content prohibited by applicable laws, support any military purpose, exploit or harm minors, generate harmful misinformation intended to deceive, or promote discrimination or hate speech.

This quantization pipeline and the recipe script in this repo (quantize-ac-protected.py) are released under the same terms as the base model, as a derivative work.

Acknowledgments

Reproducibility

Full recipe script: quantize-ac-protected.py

The script implements a three-phase protected calibration pipeline:

  • Phase A — Calibration with per-sample OOM defer, amax-only checkpoints every N samples (60 MB each, versus ~460 GB per checkpoint if saving full state), optional two-phase bucket commit with sha256 markers, wallclock watchdog (soft + hard exit). Inline export at end on successful completion.
  • Phase B (fallback) — Resume from the latest good checkpoint, process deferred samples on a larger-memory GPU flavor, rescue starved experts, export.
  • Phase C (recovery only) — Re-export from a saved checkpoint if Phases A/B completed calibration but crashed during export.

Env vars consumed by the recipe:

  • PHASE = A | B | C
  • INPUT_DIR — path to the BF16 source model
  • OUTPUT_DIR — export target (Phase A inline export + Phase B/C export)
  • TARGET_REPO_ID — HF Hub repo to publish the quantized model to
  • BUCKET_REPO_ID — HF Hub dataset repo used as a workspace for checkpoints (optional; remove Phase B/C if you don't want a bucket)
  • BUCKET_PREFIX — path prefix inside the bucket repo
  • NUM_CALIB_PER_DS (default 128)
  • MAX_SEQ (default 49152)
  • CKPT_EVERY (default 50)
  • WALLCLOCK_BUDGET_S (default 21600 = 6h; Phase A exits cleanly before cap)
  • STARVED_EXPERT_PCT_ABORT (default 1.0%)

Run for this release:

  • Job: HF Jobs a100x8, single Phase A invocation
  • Duration: ~10 hours wall-clock (01:40 UTC start → 11:41 UTC Phase A DONE → inline export → publish)
  • Outcome: status=complete-published, deferred=0, starved=0
  • 14 amax-only checkpoints written during calibration (60 MB each), ckpt 14 is the final post-rescue state

Author: saricles

Likes: 2

Downloads: 0

Tags: transformers, safetensors, minimax_m2, text-generation, minimax, nvfp4, 4-bit, quantized, compressed-tensors, vllm, DGX-Spark, GB10, MoE, agentic, tool-use, code, conversational, custom_code, en, zh, base_model:MiniMaxAI/MiniMax-M2.7, base_model:finetune:MiniMaxAI/MiniMax-M2.7, license:other, endpoints_compatible, 8-bit, region:us

samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF


language:

  • en license: apache-2.0 tags:
  • qwen3.6
  • moe
  • hermes
  • agentic
  • tool-calling
  • qlora
  • unsloth
  • carnice
  • gguf
  • llama-cpp base_model: Qwen/Qwen3.6-35B-A3B datasets:
  • bespokelabs/Bespoke-Stratos-17k
  • AI-MO/NuminaMath-CoT
  • kai-os/carnice-glm5-hermes-traces
  • open-thoughts/OpenThoughts-Agent-v1-SFT

Carnice Qwen3.6 MoE 35B-A3B — Hermes-Focused Agentic Model (GGUF)

QLoRA fine-tune of Qwen3.6-35B-A3B (MoE, 3B active parameters) optimized for agentic workflows and Hermes Agent runtime. Two-stage training adapted from kai-os/Carnice-9b.

This is the successor to Carnice-MoE-35B-A3B (based on Qwen3.5), retrained on the newer Qwen3.6 base which brings improved agentic coding, extended context (262K native, up to 1M with RoPE scaling), and native multimodal support.

Credits

Training methodology adapted from kai-os/Carnice-9b — same two-stage approach and datasets, applied to the larger MoE architecture. Key inspiration: training on actual Hermes Agent execution traces for native agentic behavior.

Available Quantizations

| Quantization | Size | Min VRAM | |---|---|---| | F16 | 65 GB | 1x 98GB GPU | | Q8_0 | 35 GB | 1x 48GB GPU | | Q6_K | 27 GB | 1x 32GB GPU | | Q5_K_M | 24 GB | 1x 32GB GPU | | Q4_K_M | 20 GB | 1x 24GB GPU | | Q4_K_S | 19 GB | 1x 24GB GPU |

For BF16 safetensors, see samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B.

Model Details

| Property | Value | |---|---| | Base Model | Qwen/Qwen3.6-35B-A3B | | Architecture | Mixture of Experts (MoE) | | Total Parameters | ~35B | | Active Parameters | ~3B per token | | Native Context Length | 262,144 tokens | | Thinking Modes | Thinking / Non-thinking (native Qwen3.6) |

What Makes This Different

Unlike generic reasoning distillation, this model was trained on actual Hermes Agent execution traces — real conversations where an AI agent:

  • Executes terminal commands and processes output
  • Performs file editing operations
  • Chains multi-step tool calls with results feeding back
  • Uses browser-assisted workflows
  • Makes decisions based on environmental feedback

This teaches the model the exact conversation patterns Hermes expects, rather than just generic reasoning.

Training Details

Two-Stage Approach

Stage A — Reasoning Repair (1 epoch)

  • Strengthens base model reasoning before agent-specific training
  • Loss: 0.4281

| Dataset | Examples | |---|---| | bespokelabs/Bespoke-Stratos-17k | 16,710 | | AI-MO/NuminaMath-CoT | 17,000 (capped) |

Stage B — Hermes Traces (2 epochs)

  • Agent-specific behavioral training on real execution traces
  • Loss: 0.3045

| Dataset | Examples | |---|---| | kai-os/carnice-glm5-hermes-traces | 1,627 (high quality) | | open-thoughts/OpenThoughts-Agent-v1-SFT | 15,209 |

Training Configuration

| Parameter | Stage A | Stage B | |---|---|---| | LoRA Rank | 64 | 64 | | LoRA Alpha | 64 | 64 | | LoRA Targets | q, k, v, o projections | q, k, v, o projections | | Learning Rate | 2e-5 (linear) | 1e-5 (cosine) | | Epochs | 1 | 2 | | Effective Batch | 12 | 12 | | Context Length | 4096 | 4096 | | Precision | 4-bit QLoRA + BF16 adapters | Same | | GPU | RTX PRO 6000 Blackwell (98GB) | Same | | Total Training Time | ~55 hours (both stages) |

Trainable Parameters

13,762,560 (0.04% of 35.1B total)

Usage with llama.cpp

# Download a quantization (e.g., Q8_0)
huggingface-cli download samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF \
  Carnice-Qwen3.6-MoE-35B-A3B-Q8_0.gguf --local-dir .

# Run with llama-server
llama-server \
  --model Carnice-Qwen3.6-MoE-35B-A3B-Q8_0.gguf \
  --n-gpu-layers -1 \
  --ctx-size 262144 \
  --host 0.0.0.0 --port 8000

Acknowledgements

  • kai-os — Carnice training methodology and Hermes traces dataset
  • open-thoughts — Agent SFT dataset
  • bespokelabs — Bespoke-Stratos reasoning dataset
  • Unsloth — QLoRA training framework
  • Qwen — Base model

Author: samuelcardillo

Likes: 2

Downloads: 0

Tags: hermes, gguf, qwen3.6, moe, agentic, tool-calling, qlora, unsloth, carnice, llama-cpp, en, dataset:bespokelabs/Bespoke-Stratos-17k, dataset:AI-MO/NuminaMath-CoT, dataset:kai-os/carnice-glm5-hermes-traces, dataset:open-thoughts/OpenThoughts-Agent-v1-SFT, base_model:Qwen/Qwen3.6-35B-A3B, base_model:quantized:Qwen/Qwen3.6-35B-A3B, license:apache-2.0, endpoints_compatible, region:us, conversational

mradermacher/LFM2-12B-A1B-SpeedDemon-High-Intelligence-Series-B-i1-GGUF


base_model: DavidAU/LFM2-12B-A1B-SpeedDemon-High-Intelligence-Series-B language:

  • en library_name: transformers license: apache-2.0 mradermacher: readme_rev: 1 quantized_by: mradermacher tags:
  • finetune
  • unsloth
  • mixture of experts
  • sparse moe
  • moe

About

<!-- ### quantize_version: 2 --> <!-- ### output_tensor_quantised: 1 --> <!-- ### convert_type: hf --> <!-- ### vocab_type: --> <!-- ### tags: nicoboss --> <!-- ### quants: Q2_K IQ3_M Q4_K_S IQ3_XXS Q3_K_M small-IQ4_NL Q4_K_M IQ2_M Q6_K IQ4_XS Q2_K_S IQ1_M Q3_K_S IQ2_XXS Q3_K_L IQ2_XS Q5_K_S IQ2_S IQ1_S Q5_K_M Q4_0 IQ3_XS Q4_1 IQ3_S --> <!-- ### quants_skip: --> <!-- ### skip_mmproj: -->

weighted/imatrix quants of https://huggingface.co/DavidAU/LFM2-12B-A1B-SpeedDemon-High-Intelligence-Series-B

<!-- provided-files -->

For a convenient overview and download list, visit our model page for this model.

static quants are available at https://huggingface.co/mradermacher/LFM2-12B-A1B-SpeedDemon-High-Intelligence-Series-B-GGUF

Usage

If you are unsure how to use GGUF files, refer to one of TheBloke's READMEs for more details, including on how to concatenate multi-part files.

Provided Quants

(sorted by size, not necessarily quality. IQ-quants are often preferable over similar sized non-IQ quants)

| Link | Type | Size/GB | Notes | |:-----|:-----|--------:|:------| | GGUF | imatrix | 0.1 | imatrix file (for creating your own quants) | | GGUF | i1-IQ1_S | 2.4 | for the desperate | | GGUF | i1-IQ1_M | 2.7 | mostly desperate | | GGUF | i1-IQ2_XXS | 3.1 | | | GGUF | i1-IQ2_XS | 3.4 | | | GGUF | i1-IQ2_S | 3.5 | | | GGUF | i1-IQ2_M | 3.8 | | | GGUF | i1-Q2_K_S | 3.9 | very low quality | | GGUF | i1-Q2_K | 4.2 | IQ3_XXS probably better | | GGUF | i1-IQ3_XXS | 4.5 | lower quality | | GGUF | i1-IQ3_XS | 4.7 | | | GGUF | i1-IQ3_S | 5.0 | beats Q3_K* | | GGUF | i1-Q3_K_S | 5.0 | IQ3_XS probably better | | GGUF | i1-IQ3_M | 5.1 | | | GGUF | i1-Q3_K_M | 5.5 | IQ3_S probably better | | GGUF | i1-Q3_K_L | 5.9 | IQ3_M probably better | | GGUF | i1-IQ4_XS | 6.1 | | | GGUF | i1-IQ4_NL | 6.5 | prefer IQ4_XS | | GGUF | i1-Q4_0 | 6.5 | fast, low quality | | GGUF | i1-Q4_K_S | 6.5 | optimal size/speed/quality | | GGUF | i1-Q4_K_M | 6.9 | fast, recommended | | GGUF | i1-Q4_1 | 7.2 | | | GGUF | i1-Q5_K_S | 7.9 | | | GGUF | i1-Q5_K_M | 8.1 | | | GGUF | i1-Q6_K | 9.4 | practically like static Q6_K |

Here is a handy graph by ikawrakow comparing some lower-quality quant types (lower is better):

image.png

And here are Artefact2's thoughts on the matter: https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9

FAQ / Model Request

See https://huggingface.co/mradermacher/model_requests for some answers to questions you might have and/or if you want some other model quantized.

Thanks

I thank my company, nethype GmbH, for letting me use its servers and providing upgrades to my workstation to enable this work in my free time. Additional thanks to @nicoboss for giving me access to his private supercomputer, enabling me to provide many more imatrix quants, at much higher quality, than I would otherwise be able to.

<!-- end -->

Author: mradermacher

Likes: 2

Downloads: 0

Tags: transformers, gguf, finetune, unsloth, mixture of experts, sparse moe, moe, en, base_model:DavidAU/LFM2-12B-A1B-SpeedDemon-High-Intelligence-Series-B, base_model:quantized:DavidAU/LFM2-12B-A1B-SpeedDemon-High-Intelligence-Series-B, license:apache-2.0, endpoints_compatible, region:us, imatrix, conversational

ValiantLabs/Qwen3.6-35B-A3B-Esper3.1


language:

  • en library_name: transformers pipeline_tag: image-text-to-text tags:
  • esper
  • esper-3.1
  • esper-3
  • valiant
  • valiant-labs
  • qwen
  • qwen-3.6
  • qwen-3.6-35b-a3b
  • 35b
  • reasoning
  • code
  • code-instruct
  • python
  • javascript
  • dev-ops
  • jenkins
  • terraform
  • ansible
  • docker
  • jenkins
  • kubernetes
  • helm
  • grafana
  • prometheus
  • shell
  • bash
  • azure
  • aws
  • gcp
  • cloud
  • scripting
  • powershell
  • problem-solving
  • architect
  • engineer
  • developer
  • creative
  • analytical
  • expert
  • rationality
  • conversational
  • chat
  • instruct base_model: Qwen/Qwen3.6-35B-A3B datasets:
  • sequelbox/Titanium3-DeepSeek-V3.1-Terminus
  • sequelbox/Tachibana3-Part1-DeepSeek-V3.1-Terminus
  • sequelbox/Tachibana3-Part2-DeepSeek-V3.2
  • sequelbox/Mitakihara-DeepSeek-R1-0528 license: apache-2.0

Support our open-source dataset and model releases!

image/jpeg

Esper 3.1: Ministral-3-3B-Reasoning-2512, Qwen3-4B-Thinking-2507, Ministral-3-8B-Reasoning-2512, Ministral-3-14B-Reasoning-2512, gpt-oss-20b, Qwen3.5-27B, Qwen3.6-35B-A3B

Esper 3.1 is a coding, architecture, and DevOps reasoning specialist built on Qwen 3.6 35B A3B!

  • Your dedicated DevOps expert: Esper 3.1 maximizes DevOps and architecture helpfulness, powered by high-difficulty DevOps and architecture data generated with DeepSeek-V3.1-Terminus!
  • Improved coding performance: challenging code-reasoning datasets stretch DeepSeek-V3.1-Terminus and DeepSeek-V3.2 to the limits, allowing Esper 3.1 to tackle harder coding tasks!
  • AI to build AI: our high-difficulty AI expertise data boosts Esper 3.1's MLOps, AI architecture, AI research, and general reasoning skills.
  • Small model sizes allow running on local desktop and mobile, plus super-fast server inference!

Prompting Guide

Esper 3.1 uses the Qwen3.6-35B-A3B prompt format.

Example inference script to get started:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "ValiantLabs/Qwen3.6-35B-A3B-Esper3.1"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
prompt = "Write a Terraform configuration that uses the `aws_ami` data source to find the latest Amazon Linux 2 AMI. Then, provision an EC2 instance using this dynamically determined AMI ID."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

# parsing thinking content
try:
    # rindex finding 248069 (</think>)
    index = len(output_ids) - output_ids[::-1].index(248069)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content)
print("content:", content)

image/jpeg

Esper 3.1 is created by Valiant Labs.

Check out our HuggingFace page to see all of our models!

We care about open source. For everyone to use.

Author: ValiantLabs

Likes: 1

Downloads: 0

Tags: transformers, safetensors, qwen3_5_moe_text, text-generation, esper, esper-3.1, esper-3, valiant, valiant-labs, qwen, qwen-3.6, qwen-3.6-35b-a3b, 35b, reasoning, code, code-instruct, python, javascript, dev-ops, jenkins, terraform, ansible, docker, kubernetes, helm, grafana, prometheus, shell, bash, azure, aws, gcp, cloud, scripting, powershell, problem-solving, architect, engineer, developer, creative, analytical, expert, rationality, conversational, chat, instruct, image-text-to-text, en, dataset:sequelbox/Titanium3-DeepSeek-V3.1-Terminus, dataset:sequelbox/Tachibana3-Part1-DeepSeek-V3.1-Terminus, dataset:sequelbox/Tachibana3-Part2-DeepSeek-V3.2, dataset:sequelbox/Mitakihara-DeepSeek-R1-0528, base_model:Qwen/Qwen3.6-35B-A3B, base_model:finetune:Qwen/Qwen3.6-35B-A3B, license:apache-2.0, endpoints_compatible, region:us