Todays AI Summary

AI Developments: Audio Codecs, Diffusion Models, and LLM Quantization Emerge

This week's AI landscape is marked by advancements in audio processing, diffusion models, and techniques for compressing large language models (LLMs). New research explores multi-view image editing, native vision-language primitives, and agentic design of compositional machines.

Research Highlights

  • Multi-View Image Editing: A paper titled "Coupled Diffusion Sampling for Training-Free Multi-View Image Editing" introduces a novel inference-time diffusion sampling method. This method enables multi-view consistent image editing using pre-trained 2D image editing models, addressing the challenge of maintaining consistency across different views of a 3D scene.
  • Native Vision-Language Models: The paper "From Pixels to Words -- Towards Native Vision-Language Primitives at Scale" presents NEO, a new family of native Vision-Language Models (VLMs). NEO is designed to align pixel and word representations effectively, integrate vision and language modules seamlessly, and embody cross-modal properties for unified vision-language processing.
  • Agentic Design of Compositional Machines: "Agentic Design of Compositional Machines" explores the use of LLMs in creating complex machines. The paper introduces BesiegeField, a testbed for machine design, and benchmarks LLMs with agentic workflows, identifying key capabilities for success in this domain.
  • KV Cache Optimization: The paper "Attention Is All You Need for KV Cache in Diffusion LLMs" introduces Elastic-Cache, a training-free, architecture-agnostic strategy that adaptively recomputes key-value (KV) caches for diffusion large language models (DLMs) to maximize prediction accuracy while minimizing decoding latency.

Model Releases

  • LongCat-Audio-Codec: Meituan-Longcat has released LongCat-Audio-Codec, an audio tokenizer and detokenizer solution tailored for speech LLMs. This codec generates semantic and acoustic tokens in parallel, enabling high-fidelity audio reconstruction at ultra-low bitrates.
  • LLaDA2.0-mini-preview: inclusionAI has introduced LLaDA2.0-mini-preview, a diffusion language model featuring a Mixture-of-Experts (MoE) architecture. This instruction-tuned model, pre-trained on approximately 20 trillion tokens, achieves efficient inference with only 1.4 billion parameters activated during inference. It demonstrates strong performance in code generation, mathematical reasoning, and tool use.
  • Qwen3 Quantized Models: Huawei-CSL has released a series of 4-bit and 3-bit quantized versions of the Qwen3 models (1.7B, 14B, and 32B). These models utilize the SINQ (Sinkhorn-Normalized Quantization) method, aiming to reduce model size while preserving accuracy.

Key Takeaways

  • Audio Processing Advances: LongCat-Audio-Codec offers a promising solution for efficient and high-fidelity audio processing in speech LLMs.
  • Efficient Diffusion Models: LLaDA2.0-mini-preview showcases the potential of MoE architectures in diffusion models, achieving high performance with reduced computational costs.
  • LLM Compression Techniques: The release of Qwen3 quantized models highlights the ongoing efforts to compress LLMs for efficient deployment, with SINQ demonstrating a novel approach to quantization.
  • Emerging Research Areas: The research papers emphasize the growing interest in multi-view image editing, native vision-language models, and the application of LLMs in complex machine design.

AI Papers for 2026-03-09

RoboPocket: Improve Robot Policies Instantly with Your Phone

Scaling imitation learning is fundamentally constrained by the efficiency of data collection. While handheld interfaces have emerged as a scalable solution for in-the-wild data acquisition, they predominantly operate in an open-loop manner: operators blindly collect demonstrations without knowing the underlying policy's weaknesses, leading to inefficient coverage of critical state distributions. Conversely, interactive methods like DAgger effectively address covariate shift but rely on physical robot execution, which is costly and difficult to scale. To reconcile this trade-off, we introduce RoboPocket, a portable system that enables Robot-Free Instant Policy Iteration using single consumer smartphones. Its core innovation is a Remote Inference framework that visualizes the policy's predicted trajectory via Augmented Reality (AR) Visual Foresight. This immersive feedback allows collectors to proactively identify potential failures and focus data collection on the policy's weak regions without requiring a physical robot. Furthermore, we implement an asynchronous Online Finetuning pipeline that continuously updates the policy with incoming data, effectively closing the learning loop in minutes. Extensive experiments demonstrate that RoboPocket adheres to data scaling laws and doubles the data efficiency compared to offline scaling strategies, overcoming their long-standing efficiency bottleneck. Moreover, our instant iteration loop also boosts sample efficiency by up to 2$\times$ in distributed environments a small number of interactive corrections per person. Project page and videos: https://robo-pocket.github.io.

POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation

Efficient and stable training of large language models (LLMs) remains a core challenge in modern machine learning systems. To address this challenge, Reparameterized Orthogonal Equivalence Training (POET), a spectrum-preserving framework that optimizes each weight matrix through orthogonal equivalence transformation, has been proposed. Although POET provides strong training stability, its original implementation incurs high memory consumption and computational overhead due to intensive matrix multiplications. To overcome these limitations, we introduce POET-X, a scalable and memory-efficient variant that performs orthogonal equivalence transformations with significantly reduced computational cost. POET-X maintains the generalization and stability benefits of POET while achieving substantial improvements in throughput and memory efficiency. In our experiments, POET-X enables the pretraining of billion-parameter LLMs on a single Nvidia H100 GPU, and in contrast, standard optimizers such as AdamW run out of memory under the same settings.

The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks

We study two recurring phenomena in Transformer language models: massive activations, in which a small number of tokens exhibit extreme outliers in a few channels, and attention sinks, in which certain tokens attract disproportionate attention mass regardless of semantic relevance. Prior work observes that these phenomena frequently co-occur and often involve the same tokens, but their functional roles and causal relationship remain unclear. Through systematic experiments, we show that the co-occurrence is largely an architectural artifact of modern Transformer design, and that the two phenomena serve related but distinct functions. Massive activations operate globally: they induce near-constant hidden representations that persist across layers, effectively functioning as implicit parameters of the model. Attention sinks operate locally: they modulate attention outputs across heads and bias individual heads toward short-range dependencies. We identify the pre-norm configuration as the key choice that enables the co-occurrence, and show that ablating it causes the two phenomena to decouple.

Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

Large language models sometimes produce false or misleading responses. Two approaches to this problem are honesty elicitation -- modifying prompts or weights so that the model answers truthfully -- and lie detection -- classifying whether a given response is false. Prior work evaluates such methods on models specifically trained to lie or conceal information, but these artificial constructions may not resemble naturally-occurring dishonesty. We instead study open-weights LLMs from Chinese developers, which are trained to censor politically sensitive topics: Qwen3 models frequently produce falsehoods about subjects like Falun Gong or the Tiananmen protests while occasionally answering correctly, indicating they possess knowledge they are trained to suppress. Using this as a testbed, we evaluate a suite of elicitation and lie detection techniques. For honesty elicitation, sampling without a chat template, few-shot prompting, and fine-tuning on generic honesty data most reliably increase truthful responses. For lie detection, prompting the censored model to classify its own responses performs near an uncensored-model upper bound, and linear probes trained on unrelated data offer a cheaper alternative. The strongest honesty elicitation techniques also transfer to frontier open-weights models including DeepSeek R1. Notably, no technique fully eliminates false responses. We release all prompts, code, and transcripts.

Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought

We provide evidence of performative chain-of-thought (CoT) in reasoning models, where a model becomes strongly confident in its final answer, but continues generating tokens without revealing its internal belief. Our analysis compares activation probing, early forced answering, and a CoT monitor across two large models (DeepSeek-R1 671B & GPT-OSS 120B) and find task difficulty-specific differences: The model's final answer is decodable from activations far earlier in CoT than a monitor is able to say, especially for easy recall-based MMLU questions. We contrast this with genuine reasoning in difficult multihop GPQA-Diamond questions. Despite this, inflection points (e.g., backtracking, 'aha' moments) occur almost exclusively in responses where probes show large belief shifts, suggesting these behaviors track genuine uncertainty rather than learned "reasoning theater." Finally, probe-guided early exit reduces tokens by up to 80% on MMLU and 30% on GPQA-Diamond with similar accuracy, positioning attention probing as an efficient tool for detecting performative reasoning and enabling adaptive computation.

Towards Provably Unbiased LLM Judges via Bias-Bounded Evaluation

As AI models progress beyond simple chatbots into more complex workflows, we draw ever closer to the event horizon beyond which AI systems will be utilized in autonomous, self-maintaining feedback loops. Any autonomous AI system will depend on automated, verifiable rewards and feedback; in settings where ground truth is sparse or non-deterministic, one practical source of such rewards is an LLM-as-a-Judge. Although LLM judges continue to improve, the literature has yet to introduce systems capable of enforcing standards with strong guarantees, particularly when bias vectors are unknown or adversarially discovered. To remedy this issue, we propose average bias-boundedness (A-BB), an algorithmic framework which formally guarantees reductions of harm/impact as a result of any measurable bias in an LLM judge. Evaluating on Arena-Hard-Auto with four LLM judges, we achieve (tau=0.5, delta=0.01) bias-bounded guarantees while retaining 61-99% correlation with original rankings across formatting and schematic bias settings, with most judge-bias combinations exceeding 80%. The code to reproduce our findings is available at https://github.com/penfever/bias-bounded-evaluation.

SurvHTE-Bench: A Benchmark for Heterogeneous Treatment Effect Estimation in Survival Analysis

Estimating heterogeneous treatment effects (HTEs) from right-censored survival data is critical in high-stakes applications such as precision medicine and individualized policy-making. Yet, the survival analysis setting poses unique challenges for HTE estimation due to censoring, unobserved counterfactuals, and complex identification assumptions. Despite recent advances, from Causal Survival Forests to survival meta-learners and outcome imputation approaches, evaluation practices remain fragmented and inconsistent. We introduce SurvHTE-Bench, the first comprehensive benchmark for HTE estimation with censored outcomes. The benchmark spans (i) a modular suite of synthetic datasets with known ground truth, systematically varying causal assumptions and survival dynamics, (ii) semi-synthetic datasets that pair real-world covariates with simulated treatments and outcomes, and (iii) real-world datasets from a twin study (with known ground truth) and from an HIV clinical trial. Across synthetic, semi-synthetic, and real-world settings, we provide the first rigorous comparison of survival HTE methods under diverse conditions and realistic assumption violations. SurvHTE-Bench establishes a foundation for fair, reproducible, and extensible evaluation of causal survival methods. The data and code of our benchmark are available at: https://github.com/Shahriarnz14/SurvHTE-Bench .

Leveraging LLM Parametric Knowledge for Fact Checking without Retrieval

Trustworthiness is a core research challenge for agentic AI systems built on Large Language Models (LLMs). To enhance trust, natural language claims from diverse sources, including human-written text, web content, and model outputs, are commonly checked for factuality by retrieving external knowledge and using an LLM to verify the faithfulness of claims to the retrieved evidence. As a result, such methods are constrained by retrieval errors and external data availability, while leaving the models intrinsic fact-verification capabilities largely unused. We propose the task of fact-checking without retrieval, focusing on the verification of arbitrary natural language claims, independent of their source. To study this setting, we introduce a comprehensive evaluation framework focused on generalization, testing robustness to (i) long-tail knowledge, (ii) variation in claim sources, (iii) multilinguality, and (iv) long-form generation. Across 9 datasets, 18 methods and 3 models, our experiments indicate that logit-based approaches often underperform compared to those that leverage internal model representations. Building on this finding, we introduce INTRA, a method that exploits interactions between internal representations and achieves state-of-the-art performance with strong generalization. More broadly, our work establishes fact-checking without retrieval as a promising research direction that can complement retrieval-based frameworks, improve scalability, and enable the use of such systems as reward signals during training or as components integrated into the generation process.

Distributed Partial Information Puzzles: Examining Common Ground Construction Under Epistemic Asymmetry

Establishing common ground, a shared set of beliefs and mutually recognized facts, is fundamental to collaboration, yet remains a challenge for current AI systems, especially in multimodal, multiparty settings, where the collaborators bring different information to the table. We introduce the Distributed Partial Information Puzzle (DPIP), a collaborative construction task that elicits rich multimodal communication under epistemic asymmetry. We present a multimodal dataset of these interactions, annotated and temporally aligned across speech, gesture, and action modalities to support reasoning over propositional content and belief dynamics. We then evaluate two paradigms for modeling common ground (CG): (1) state-of-the-art large language models (LLMs), prompted to infer shared beliefs from multimodal updates, and (2) an axiomatic pipeline grounded in Dynamic Epistemic Logic (DEL) that incrementally performs the same task. Results on the annotated DPIP data indicate that it poses a challenge to modern LLMs' abilities to track both task progression and belief state.

RealWonder: Real-Time Physical Action-Conditioned Video Generation

Current video generation models cannot simulate physical consequences of 3D actions like forces and robotic manipulations, as they lack structural understanding of how actions affect 3D scenes. We present RealWonder, the first real-time system for action-conditioned video generation from a single image. Our key insight is using physics simulation as an intermediate bridge: instead of directly encoding continuous actions, we translate them through physics simulation into visual representations (optical flow and RGB) that video models can process. RealWonder integrates three components: 3D reconstruction from single images, physics simulation, and a distilled video generator requiring only 4 diffusion steps. Our system achieves 13.2 FPS at 480x832 resolution, enabling interactive exploration of forces, robot actions, and camera controls on rigid objects, deformable bodies, fluids, and granular materials. We envision RealWonder opens new opportunities to apply video models in immersive experiences, AR/VR, and robot learning. Our code and model weights are publicly available in our project website: https://liuwei283.github.io/RealWonder/

AI Models

Poralus/Poralus-Image-1357


language:

  • en license: creativeml-openrail-m tags:
  • stable-diffusion
  • stable-diffusion-diffusers
  • text-to-image
  • diffusers
  • fine-tuned
  • landscape
  • photography pipeline_tag: text-to-image base_model: runwayml/stable-diffusion-v1-5 datasets:
  • zh-plus/tiny-imagenet
  • laion/laion-coco inference: true

Poralus-Image-1357

We are pleased to introduce Poralus-Image-1357, a fine-tuned text-to-image generation model built on top of Stable Diffusion v1.5. The model was developed and trained by Poralus with a focus on producing high-quality, atmospheric imagery with particular strength in natural environments, cinematic lighting, and compositional depth.

Training was conducted incrementally across multiple rounds, with each session building directly on the previous checkpoint rather than restarting from the base model. This approach preserves previously learned visual knowledge while progressively expanding the model's capabilities.


Key Characteristics

  • Atmospheric Natural Landscapes — The model demonstrates strong capability in rendering outdoor environments including mountains, forests, coastlines, and open terrain with realistic lighting and mood.
  • Cinematic Color Grading — Outputs consistently exhibit a distinctive color treatment, favoring warm golden tones, desaturated moody palettes, and dramatic pink-to-purple sky gradients.
  • Compositional Framing — The model has developed a tendency toward natural frame-within-frame compositions, using rock arches, foliage, and architectural openings to direct depth and focus.
  • Seasonal and Atmospheric Conditions — Fog, mist, golden hour, and overcast lighting are rendered with high fidelity and consistency.

Quick Start

Install the required library:

pip install diffusers transformers accelerate torch

Basic usage:

from diffusers import StableDiffusionPipeline
import torch

pipe = StableDiffusionPipeline.from_pretrained(
    "Poralus/Poralus-Image-1357",
    torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")

prompt = "a misty forest path in autumn with golden leaves, cinematic lighting, atmospheric depth"

image = pipe(
    prompt=prompt,
    num_inference_steps=50,
    guidance_scale=7.5,
    height=512,
    width=512,
).images[0]

image.save("output.png")

Recommended Settings

| Parameter | Recommended Value | Notes | |---|---|---| | num_inference_steps | 30 to 50 | Higher values produce sharper detail | | guidance_scale | 7.0 to 9.0 | Higher values adhere more strictly to the prompt | | Resolution | 512 x 512 | Native training resolution | | Negative prompt | low quality, blurry, oversaturated, flat lighting | Improves output consistency |


Prompt Guidelines

The model responds well to descriptive prompts that include environmental context, lighting conditions, and atmospheric cues. Generic prompts tend to produce competent but uncharacteristic results; prompts that lean into the model's learned aesthetic produce the strongest outputs.

Recommended prompt structure:

[subject or scene], [setting and environment], [lighting condition], [mood or atmosphere], [quality descriptor]

Effective examples:

a calm mountain lake at dusk, soft reflections on still water, overcast sky, moody and cinematic

a misty forest path in autumn, golden and amber leaves on the ground, fog through the trees, natural diffused light

a dramatic coastal landscape at sunset, crashing waves on dark rocks, warm pink and orange sky, wide angle

a vast open plain at twilight, distant treeline, dramatic gradient sky from deep blue to warm pink

Sample Outputs

Prompt: a snowy mountain peak at dawn with pink clouds

The model interpreted this as a view through a natural rock arch framing a deep purple and pink sky, with distant mountains on the horizon. The composition demonstrates the model's tendency to introduce natural framing elements.


Prompt: a tropical beach with turquoise water and palm trees

Output rendered as a view through a low palm canopy arch toward a turquoise ocean with a white sand shoreline. Color accuracy for water was high. The arch framing motif appeared again, consistent with the model's learned compositional style.


Prompt: a misty forest path in autumn with golden leaves

Strong prompt adherence. Output featured a central vanishing-point path through tall trees with full autumn foliage, ground covered in fallen leaves, and soft fog filling the mid-ground. Color grading was warm and accurate. This category represents the model's strongest performance domain.


Prompt: a dramatic thunderstorm over an open field

The model rendered this as a desaturated black-and-white river landscape under a heavy overcast sky, capturing the atmospheric mood rather than the literal subject. Demonstrates the model's tendency to interpret dramatic atmospheric prompts through a landscape lens.


Prompt: a calm lake reflecting the milky way at night

Output produced a moody mountain lake scene with a teal-grey color palette, dramatic peak reflections, and heavy cloud cover. The literal prompt element (milky way) was not rendered; the model defaulted to its learned atmospheric treatment of night and water scenes.


Training Details

| Parameter | Value | |---|---| | Base Model | runwayml/stable-diffusion-v1-5 | | Training Method | Full UNet fine-tune, multi-round continued training | | Optimizer | AdamW 8-bit (bitsandbytes) | | Learning Rate | 1e-5 to 5e-5, cosine annealing decay | | Batch Size | 1 (effective batch size 4 via gradient accumulation) | | Gradient Accumulation Steps | 4 | | Mixed Precision | fp16 | | Gradient Checkpointing | Enabled | | Resolution | 512 x 512 | | Hardware | NVIDIA Tesla T4 (16 GB VRAM) | | Training Duration | Multiple 1-hour sessions, each continuing from prior checkpoint |

Training Data

Images were drawn from the following sources and stored locally prior to training:

  • zh-plus/tiny-imagenet — 200-class image dataset covering diverse object and scene categories
  • laion/laion-coco — Aesthetically filtered web images paired with descriptive captions
  • Supplementary curated pool — 600 images across 30 categories including natural landscapes, architecture, urban environments, portraits, food, and abstract subjects

Training round 2 shifted dataset emphasis toward human figures, indoor architecture, and urban scenes to address weaknesses identified during evaluation of round 1 outputs.


Known Limitations

  • Human figures — Faces and body proportions lack the fine detail present in models specifically trained for portrait generation. Human subjects in complex poses may render with anatomical inconsistencies.
  • Indoor and architectural interiors — Rooms, furniture, and constructed environments are rendered with less precision than outdoor scenes. Prompt adherence for interior subjects is lower than for landscapes.
  • Literal prompt fidelity — The model frequently substitutes learned aesthetic patterns for literal prompt elements (e.g., interpreting "thunderstorm" as a moody greyscale landscape rather than active storm imagery).
  • Text rendering — In-image text is not supported and will render as visual noise if requested.
  • Resolution — The model was trained at 512 x 512. Generating at higher resolutions without tiling will produce degraded results.

License

This model is released under the CreativeML Open RAIL-M License. Use of this model is subject to the terms of that license, including restrictions on harmful, deceptive, and non-consensual content generation.


Citation

This model is built on Stable Diffusion v1.5. If you use it in published work, please cite the original Latent Diffusion Models paper:

Author: Poralus

Likes: 4

Downloads: 0

Tags: diffusers, safetensors, stable-diffusion, stable-diffusion-diffusers, text-to-image, fine-tuned, landscape, photography, en, dataset:zh-plus/tiny-imagenet, dataset:laion/laion-coco, base_model:runwayml/stable-diffusion-v1-5, base_model:finetune:runwayml/stable-diffusion-v1-5, license:creativeml-openrail-m, endpoints_compatible, diffusers:StableDiffusionPipeline, region:us

neuralnets/sarvam-30b-4bit

Sarvam-30B 4-Bit (BitsAndBytes)

This repository provides a 4-bit NF4 quantized version of the base model sarvamai/sarvam-30b using bitsandbytes. The quantization significantly reduces GPU memory usage while preserving strong inference performance.

Base model sarvamai/sarvam-30b

Architecture SarvamMoEForCausalLM


Quantization Details

Quantization method: BitsAndBytes 4-bit (NF4)

Configuration used:

  • load_in_4bit = True
  • bnb_4bit_quant_type = nf4
  • bnb_4bit_compute_dtype = float16
  • bnb_4bit_use_double_quant = True

Approximate GPU memory usage:

| Model | GPU VRAM | | ------------- | --------- | | FP16 original | ~60 GB | | 4-bit NF4 | ~16-18 GB |

This version is recommended for most users who want to run the model with reduced hardware requirements.


Installation

Install the required libraries.

pip install transformers accelerate bitsandbytes torch safetensors

Loading the Model

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "neuralnets/sarvam-30b-4bit",
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained(
    "neuralnets/sarvam-30b-4bit",
    trust_remote_code=True
)

Example Inference

prompt = "Explain mixture of experts in simple terms."

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=200
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Hardware Requirements

Recommended GPUs:

  • A100 40GB or 80GB
  • RTX 4090
  • RTX 3090 (with offloading)

CPU RAM recommendation:

  • 32 GB or higher

Notes

  • This model uses bitsandbytes quantization integrated into Hugging Face Transformers.
  • The Sarvam architecture requires trust_remote_code=True.
  • Designed primarily for inference workloads.

Base Model

Original model:

sarvamai/sarvam-30b

Please refer to the base repository for model training details and benchmarks.


License

This repository distributes a quantized derivative of the original model.

Users must follow the license of the upstream model:

sarvamai/sarvam-30b

Author: neuralnets

Likes: 4

Downloads: 0

Tags: safetensors, sarvam_moe, custom_code, 4-bit, bitsandbytes, region:us

ArliAI/Qwen-3.5-27B-Derestricted

Author: ArliAI

Likes: 4

Downloads: 0

Tags: safetensors, qwen3_5, region:us

Jackrong/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF

⚠️ Sorry, the GGUF quantized version isn't available yet — my machine currently doesn't have enough storage to run the quantization. I've released an MLX version first in the meantime.

Author: Jackrong

Likes: 4

Downloads: 0

Tags: region:us

llmfan46/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-heretic


language:

  • en
  • zh license: apache-2.0 base_model:
  • Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled tags:
  • unsloth
  • qwen
  • qwen3.5
  • reasoning
  • chain-of-thought
  • Dense
  • heretic
  • uncensored
  • decensored
  • abliterated pipeline_tag: text-generation datasets:
  • nohurry/Opus-4.6-Reasoning-3000x-filtered
  • Jackrong/Qwen3.5-reasoning-700x

This is a decensored version of Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled, made using Heretic v1.2.0 with Arbitrary-Rank Ablation (ARA)

Abliteration parameters

| Parameter | Value | | :-------- | :---: | | start_layer_index | 12 | | end_layer_index | 44 | | preserve_good_behavior_weight | 0.5198 | | steer_bad_behavior_weight | 0.0011 | | overcorrect_relative_weight | 0.5220 | | neighbor_count | 10 |

Performance

| Metric | This model | Original model (Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled) | | :----- | :--------: | :---------------------------: | | KL divergence | 0.0092 | 0 (by definition) | | Refusals | 21/100 | 98/100 |

Lower refusals indicate fewer content restrictions, while lower KL divergence indicates better preservation of the original model's capabilities. Higher refusals cause more rejections, objections, pushbacks, lecturing, censorship, softening and deflections, while higher KL divergence degrades coherence, reasoning ability, and overall quality.

GGUF Version

GGUF quantizations available here llmfan46/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-heretic-GGUF.


🌟 Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled

📢 Release Note Build Environment Upgrades:

  • Fine-tuning Framework: Unsloth 2026.3.3
  • Core Dependencies: Transformers 5.2.0
  • This model fixes the crash in the official model caused by the Jinja template not supporting the "developer" role. (commonly sent by modern coding agents like Claude Code and OpenCode)
  • It does not disable thinking mode by default, and allowing the agent to run continuously for over 9 minutes without interruption.
  • Compared to the original model, autonomy and stability are significantly improved.

HB8AleUaMAArNyM

💡 Model Introduction

Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled is a highly capable reasoning model fine-tuned on top of the powerful Qwen3.5 architecture. The model's core directive is to leverage state-of-the-art Chain-of-Thought (CoT) distillation primarily sourced from Claude-4.6 Opus interactions.

Through Supervised Fine-Tuning (SFT) focusing specifically on structured reasoning logic, this model excels in breaking down complex user problems, planning step-by-step methodologies within strictly formatted <think> tags, and ultimately delivering precise, nuanced solutions.

🧠 Example of Learned Reasoning Scaffold(Example)

The model includes targeted optimizations addressing Qwen3.5’s tendency toward excessive transitional or repetitive reasoning on simple queries. Through deep distillation and structural imitation of Claude-4.6-Opus reasoning chains, the model adopts a more efficient structured thinking pattern:
“Let me analyze this request carefully: 1..2..3...”.
This streamlined reasoning paradigm significantly reduces redundant cognitive loops while preserving deep analytical capacity, resulting in substantially improved inference efficiency.

Let me analyze this request carefully:

1. Identify the core objective of the problem.
2. Break the task into clearly defined subcomponents.
3. Evaluate constraints and edge cases.
4. Formulate a step-by-step solution plan.
5. Execute the reasoning sequentially and verify consistency.
            .
            .
            .

🗺️ Training Pipeline Overview

Base Model (Qwen3.5-27B)
 │
 ▼
Supervised Fine-Tuning (SFT) + LoRA
 │
 ▼
Final Model (Claude-4.6-Opus-Reasoning-Distilled,text-only)

📋 Stage Details

🔥Community-tested advantages (benchmark tests by user @sudoingX on a single RTX 3090):

Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled shows significant advantages in coding-agent environments such as Claude Code and OpenCode:

  • Native support for the “developer” role, requiring no Jinja template patches or ChatML workarounds.
  • Thinking mode fully preserved (logs confirm thinking=1), not silently disabled, maintaining the complete chain-of-thought reasoning process.
  • Greatly improved autonomy and stability — capable of running continuously for over 9 minutes autonomously (with zero human intervention). It actively waits for tool responses, reads outputs, self-corrects errors, and can even automatically generate a README, whereas the base model often stalls or freezes mid-execution.

Hardware usage remains unchanged:

  • About 16.5 GB VRAM with Q4_K_M quantization
  • 29–35 tok/s generation speed
  • Full 262K context with no compromises
  • These improvements come from successfully distilling the structured reasoning style of Claude 4.6 Opus, allowing Qwopus to be truly plug-and-play in modern local coding agents and deliver an experience close to Opus in smoothness and usability.

Thanks to the community for the in-depth testing and feedback!

🔹 Supervised Fine-Tuning (SFT)

  • Objective: To inject high-density reasoning logic and establish a strict format for problem-solving involving an internal thinking state prior to outputting the final response.
  • Methodology: We utilized Unsloth for highly efficient memory and compute optimization. A critical component of this stage is the train_on_responses_only strategy, masking instructions so the loss is purely calculated over the generation of the <think> sequences and the subsequent solutions.
  • Format Enforcement: All training samples were systematically normalized so the model strictly abides by the structure <think> {internal reasoning} </think>\n {final answer}.

📚 All Datasets Used

The dataset consists of high-quality, filtered reasoning distillation data:

| Dataset Name | Description / Purpose | |--------------|-----------------------| | nohurry/Opus-4.6-Reasoning-3000x-filtered | Provides comprehensive Claude 4.6 Opus reasoning trajectories. | | TeichAI/claude-4.5-opus-high-reasoning-250x | Injecting high-intensity, structured reasoning instances. | | Jackrong/Qwen3.5-reasoning-700x | Additional curated reasoning samples designed to strengthen structured step-by-step problem solving and improve reasoning diversity. |

🌟 Core Skills & Capabilities

  1. Modular & Structured Thinking: Inheriting traits from Opus-level reasoning, the model demonstrates confident parsing of the prompt, establishing an outlined plan in its <think> block sequentially rather than exploratory "trial-and-error" self-doubt.

⚠️ Limitations & Intended Use

  • Hallucination Risk: While reasoning is strong, the model remains an autoregressive LLM; external facts provided during the thinking sequence may occasionally contain hallucinations if verifying real-world events.
  • Intended Scenario: Best suited for offline analytical tasks, coding, math, and heavy logic-dependent prompting where the user needs to transparently follow the AI's internal logic.
  • Preview Version Notice: Because this model is relatively new and intentionally lightweight, the surrounding ecosystem — including inference templates, fine-tuning pipelines, routing configurations, and tooling integrations — may not yet be fully mature or standardized. As a result, users may encounter occasional bugs, compatibility inconsistencies, or integration edge cases. The current release should be considered a preview build while the broader architectural stack and supporting utilities continue to stabilize and improve.

🙏 Acknowledgements

Significant thanks to the Unsloth AI team for making rapid fine-tuning of MoE and large LLM models accessible. Additionally, we acknowledge Qwen internally, and the open-source community developers producing exceptional distilled datasets (nohurry and TeichAI).

📖 Citation

If you use this model in your research or projects, please cite:

@misc{jackrong_qwen35_opus_distilled,
  title        = {Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled},
  author       = {Jackrong},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled}}
}

Author: llmfan46

Likes: 3

Downloads: 0

Tags: safetensors, qwen3_5, unsloth, qwen, qwen3.5, reasoning, chain-of-thought, Dense, heretic, uncensored, decensored, abliterated, text-generation, conversational, en, zh, dataset:nohurry/Opus-4.6-Reasoning-3000x-filtered, dataset:Jackrong/Qwen3.5-reasoning-700x, base_model:Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled, base_model:finetune:Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled, license:apache-2.0, region:us

Naphula/Bella-Bartender-8B-v1-GGUF

Author: Naphula

Likes: 2

Downloads: 0

Tags: region:us

hank87/ltxdreamlyyy

Author: hank87

Likes: 2

Downloads: 0

Tags: region:us

neuralnets/sarvam-30b-8bit

Sarvam-30B 8-Bit (BitsAndBytes)

This repository provides an 8-bit quantized version of the base model sarvamai/sarvam-30b using bitsandbytes.

8-bit quantization reduces memory usage while maintaining very high model quality.

Base model sarvamai/sarvam-30b

Architecture SarvamMoEForCausalLM


Quantization Details

Quantization method: BitsAndBytes 8-bit

Configuration used:

  • load_in_8bit = True

Approximate GPU memory usage:

| Model | GPU VRAM | | ------------- | -------- | | FP16 original | ~60 GB | | 8-bit | ~30 GB |

This version provides near-FP16 quality while using roughly half the memory.


Installation

Install dependencies.

pip install transformers accelerate bitsandbytes torch safetensors

Loading the Model

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "neuralnets/sarvam-30b-8bit",
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained(
    "neuralnets/sarvam-30b-8bit",
    trust_remote_code=True
)

Example Inference

prompt = "Explain mixture of experts in simple terms."

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=200
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Hardware Requirements

Recommended GPUs:

  • A100 40GB or 80GB
  • RTX 4090
  • RTX 3090

CPU RAM recommendation:

  • 32 GB or more

Notes

  • Uses bitsandbytes 8-bit quantization integrated with Hugging Face Transformers.
  • Requires trust_remote_code=True due to the Sarvam architecture.
  • Suitable for high-quality inference.

Base Model

Original model repository:

sarvamai/sarvam-30b

Refer to the base model page for detailed information about training and architecture.


License

This repository distributes a quantized derivative of the upstream model.

Users must comply with the license of the original model:

sarvamai/sarvam-30b

Author: neuralnets

Likes: 2

Downloads: 0

Tags: safetensors, sarvam_moe, custom_code, 8-bit, bitsandbytes, region:us

mradermacher/Qwen3.5-27B-Esper3.1-i1-GGUF


base_model: ValiantLabs/Qwen3.5-27B-Esper3.1 datasets:

  • sequelbox/Titanium3-DeepSeek-V3.1-Terminus
  • sequelbox/Tachibana3-Part1-DeepSeek-V3.1-Terminus
  • sequelbox/Tachibana3-Part2-DeepSeek-V3.2
  • sequelbox/Mitakihara-DeepSeek-R1-0528 language:
  • en library_name: transformers license: apache-2.0 mradermacher: readme_rev: 1 quantized_by: mradermacher tags:
  • esper
  • esper-3.1
  • esper-3
  • valiant
  • valiant-labs
  • qwen
  • qwen-3.5
  • qwen-3.5-27b
  • 27b
  • reasoning
  • code
  • code-instruct
  • python
  • javascript
  • dev-ops
  • jenkins
  • terraform
  • ansible
  • docker
  • jenkins
  • kubernetes
  • helm
  • grafana
  • prometheus
  • shell
  • bash
  • azure
  • aws
  • gcp
  • cloud
  • scripting
  • powershell
  • problem-solving
  • architect
  • engineer
  • developer
  • creative
  • analytical
  • expert
  • rationality
  • conversational
  • chat
  • instruct

About

<!-- ### quantize_version: 2 --> <!-- ### output_tensor_quantised: 1 --> <!-- ### convert_type: hf --> <!-- ### vocab_type: --> <!-- ### tags: nicoboss --> <!-- ### quants: Q2_K IQ3_M Q4_K_S IQ3_XXS Q3_K_M small-IQ4_NL Q4_K_M IQ2_M Q6_K IQ4_XS Q2_K_S IQ1_M Q3_K_S IQ2_XXS Q3_K_L IQ2_XS Q5_K_S IQ2_S IQ1_S Q5_K_M Q4_0 IQ3_XS Q4_1 IQ3_S --> <!-- ### quants_skip: --> <!-- ### skip_mmproj: -->

weighted/imatrix quants of https://huggingface.co/ValiantLabs/Qwen3.5-27B-Esper3.1

<!-- provided-files -->

For a convenient overview and download list, visit our model page for this model.

static quants are available at https://huggingface.co/mradermacher/Qwen3.5-27B-Esper3.1-GGUF

Usage

If you are unsure how to use GGUF files, refer to one of TheBloke's READMEs for more details, including on how to concatenate multi-part files.

Provided Quants

(sorted by size, not necessarily quality. IQ-quants are often preferable over similar sized non-IQ quants)

| Link | Type | Size/GB | Notes | |:-----|:-----|--------:|:------| | GGUF | imatrix | 0.1 | imatrix file (for creating your own quants) | | GGUF | i1-IQ1_S | 6.3 | for the desperate | | GGUF | i1-IQ1_M | 6.9 | mostly desperate | | GGUF | i1-IQ2_XXS | 7.8 | | | GGUF | i1-IQ2_XS | 8.5 | | | GGUF | i1-IQ2_S | 8.8 | | | GGUF | i1-IQ2_M | 9.5 | | | GGUF | i1-Q2_K_S | 9.8 | very low quality | | GGUF | i1-Q2_K | 10.2 | IQ3_XXS probably better | | GGUF | i1-IQ3_XXS | 10.8 | lower quality | | GGUF | i1-IQ3_XS | 11.7 | | | GGUF | i1-Q3_K_S | 12.2 | IQ3_XS probably better | | GGUF | i1-IQ3_S | 12.2 | beats Q3_K* | | GGUF | i1-IQ3_M | 12.7 | | | GGUF | i1-Q3_K_M | 13.4 | IQ3_S probably better | | GGUF | i1-Q3_K_L | 14.1 | IQ3_M probably better | | GGUF | i1-IQ4_XS | 14.8 | | | GGUF | i1-Q4_0 | 15.6 | fast, low quality | | GGUF | i1-Q4_K_S | 15.7 | optimal size/speed/quality | | GGUF | i1-Q4_K_M | 16.6 | fast, recommended | | GGUF | i1-Q4_1 | 17.2 | | | GGUF | i1-Q5_K_S | 18.8 | | | GGUF | i1-Q5_K_M | 19.5 | | | GGUF | i1-Q6_K | 22.2 | practically like static Q6_K |

Here is a handy graph by ikawrakow comparing some lower-quality quant types (lower is better):

image.png

And here are Artefact2's thoughts on the matter: https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9

FAQ / Model Request

See https://huggingface.co/mradermacher/model_requests for some answers to questions you might have and/or if you want some other model quantized.

Thanks

I thank my company, nethype GmbH, for letting me use its servers and providing upgrades to my workstation to enable this work in my free time. Additional thanks to @nicoboss for giving me access to his private supercomputer, enabling me to provide many more imatrix quants, at much higher quality, than I would otherwise be able to.

<!-- end -->

Author: mradermacher

Likes: 2

Downloads: 0

Tags: transformers, gguf, esper, esper-3.1, esper-3, valiant, valiant-labs, qwen, qwen-3.5, qwen-3.5-27b, 27b, reasoning, code, code-instruct, python, javascript, dev-ops, jenkins, terraform, ansible, docker, kubernetes, helm, grafana, prometheus, shell, bash, azure, aws, gcp, cloud, scripting, powershell, problem-solving, architect, engineer, developer, creative, analytical, expert, rationality, conversational, chat, instruct, en, dataset:sequelbox/Titanium3-DeepSeek-V3.1-Terminus, dataset:sequelbox/Tachibana3-Part1-DeepSeek-V3.1-Terminus, dataset:sequelbox/Tachibana3-Part2-DeepSeek-V3.2, dataset:sequelbox/Mitakihara-DeepSeek-R1-0528, base_model:ValiantLabs/Qwen3.5-27B-Esper3.1, base_model:quantized:ValiantLabs/Qwen3.5-27B-Esper3.1, license:apache-2.0, endpoints_compatible, region:us, imatrix

michaelw9999/Qwen3.5-27B-NVFP4-GGUF

Author: michaelw9999

Likes: 2

Downloads: 0

Tags: gguf, endpoints_compatible, region:us, imatrix, conversational