Todays AI Summary

AI Developments: Long-Context Modeling, Code Alignment, and More

Here's a look at the latest advancements in AI, covering new research papers and models released today.

Research Highlights

  • Artificial Hippocampus Networks for Efficient Long-Context Modeling: A new paper introduces Artificial Hippocampus Networks (AHN), inspired by cognitive science, to improve long-sequence modeling. AHNs augment models like Qwen2.5-3B-Instruct, reducing computational and memory requirements while enhancing performance on long-context benchmarks.
  • Aligning Code Evaluation with Human Preference: Researchers introduce VeriCode, a taxonomy of verifiable code instructions, and Vibe Checker, a testbed to assess code instruction following and functional correctness. The study reveals that instruction following is crucial for aligning with human preferences in coding.
  • 5D Surrogates for Gyrokinetic Plasma Turbulence Simulations: A novel neural surrogate called GyroSwin is introduced for modeling 5D nonlinear gyrokinetic simulations. GyroSwin extends Vision Transformers to 5D and reduces the cost of fully resolved nonlinear gyrokinetics significantly.
  • Bootstrapping LLMs for Long-Horizon Reasoning: A scalable method is presented to enhance long-horizon reasoning capabilities of LLMs using only existing short-horizon data. The approach involves synthetically composing simple problems into complex dependency chains and training models with outcome-only rewards.
  • Scaling MLE Tasks with Automated Multi-Agent Pipeline: MLE-Smith, a fully automated multi-agent pipeline, is introduced to transform raw datasets into competition-style MLE challenges. This pipeline aims to scale MLE tasks with verifiable quality and real-world usability.

Model Releases

  • YanoljaNEXT-Rosetta-12B-2510: Yanolja has released a fine-tuned version of the Gemma3 architecture, designed for translating structured data (JSON format) across 30+ languages. It achieves a CHrF++ score of 37.36 on English to Korean translation, outperforming models like GPT-4o and Gemini 2.5 Flash.
  • Qwen-Image-Edit Enhanced Model: Eddy has enhanced Phr00t's Qwen-Rapid-AIO-v3 model, improving facial processing and semantic instruction adherence. The enhanced model maintains compatibility with existing systems and workflows.
  • Noobai11-CLIP-L-and-BigG-Anime-Text-Encoders: Anzhc has released a model based on diffusers that uses CLIP-ViT and noobai-XL-1.0.
  • NikolayKozloff/UserLM-8b-Q8_0-GGUF: NikolayKozloff has released a model converted to GGUF format from microsoft/UserLM-8b.
  • EugeneEvstafev/llm7-graph-270m-it-ft-20251009: EugeneEvstafev has released a model that has been fine-tuned to extract and generate knowledge graphs from unstructured text.
  • violeteverisland/Neko-Chat: violeteverisland has released a Chinese lightweight dialogue model based on DeepSeek-R1-Distill-Qwen-1.5B.
  • OPPOer/Qwen-Image-Edit-2509-Pruning: OPPOer has released a model based on Qwen-Image-Edit and has attempted model pruning, resulting in a model size of 13.6B parameters.
  • NikolayKozloff/YanoljaNEXT-Rosetta-12B-2510-Q4_K_M-GGUF: NikolayKozloff has released a model converted to GGUF format from yanolja/YanoljaNEXT-Rosetta-12B-2510.
  • NikolayKozloff/YanoljaNEXT-Rosetta-12B-2510-Q5_K_S-GGUF: NikolayKozloff has released a model converted to GGUF format from yanolja/YanoljaNEXT-Rosetta-12B-2510.

Key Takeaways

  • Long-Context Efficiency: Research is focusing on improving the efficiency of long-context models, balancing memory compression and fidelity.
  • Human-Aligned Evaluation: New evaluation methods are emerging to better align AI models with human preferences, particularly in code generation.
  • Multimodal Advancements: Models like Qwen-Image-Edit are being enhanced to improve specific capabilities like facial processing and instruction following.

AI Papers for 2026-03-13

COMIC: Agentic Sketch Comedy Generation

We propose a fully automated AI system that produces short comedic videos similar to sketch shows such as Saturday Night Live. Starting with character references, the system employs a population of agents loosely based on real production studio roles, structured to optimize the quality and diversity of ideas and outputs through iterative competition, evaluation, and improvement. A key contribution is the introduction of LLM critics aligned with real viewer preferences through the analysis of a corpus of comedy videos on YouTube to automatically evaluate humor. Our experiments show that our framework produces results approaching the quality of professionally produced sketches while demonstrating state-of-the-art performance in video generation.

LiTo: Surface Light Field Tokenization

We propose a 3D latent representation that jointly models object geometry and view-dependent appearance. Most prior works focus on either reconstructing 3D geometry or predicting view-independent diffuse appearance, and thus struggle to capture realistic view-dependent effects. Our approach leverages that RGB-depth images provide samples of a surface light field. By encoding random subsamples of this surface light field into a compact set of latent vectors, our model learns to represent both geometry and appearance within a unified 3D latent space. This representation reproduces view-dependent effects such as specular highlights and Fresnel reflections under complex lighting. We further train a latent flow matching model on this representation to learn its distribution conditioned on a single input image, enabling the generation of 3D objects with appearances consistent with the lighting and materials in the input. Experiments show that our approach achieves higher visual quality and better input fidelity than existing methods.

Neural Field Thermal Tomography: A Differentiable Physics Framework for Non-Destructive Evaluation

We propose Neural Field Thermal Tomography (NeFTY), a differentiable physics framework for the quantitative 3D reconstruction of material properties from transient surface temperature measurements. While traditional thermography relies on pixel-wise 1D approximations that neglect lateral diffusion, and soft-constrained Physics-Informed Neural Networks (PINNs) often fail in transient diffusion scenarios due to gradient stiffness, NeFTY parameterizes the 3D diffusivity field as a continuous neural field optimized through a rigorous numerical solver. By leveraging a differentiable physics solver, our approach enforces thermodynamic laws as hard constraints while maintaining the memory efficiency required for high-resolution 3D tomography. Our discretize-then-optimize paradigm effectively mitigates the spectral bias and ill-posedness inherent in inverse heat conduction, enabling the recovery of subsurface defects at arbitrary scales. Experimental validation on synthetic data demonstrates that NeFTY significantly improves the accuracy of subsurface defect localization over baselines. Additional details at https://cab-lab-princeton.github.io/nefty/

V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation

Generating music that temporally aligns with video events is challenging for existing text-to-music models, which lack fine-grained temporal control. We introduce V2M-Zero, a zero-pair video-to-music generation approach that outputs time-aligned music for video. Our method is motivated by a key observation: temporal synchronization requires matching when and how much change occurs, not what changes. While musical and visual events differ semantically, they exhibit shared temporal structure that can be captured independently within each modality. We capture this structure through event curves computed from intra-modal similarity using pretrained music and video encoders. By measuring temporal change within each modality independently, these curves provide comparable representations across modalities. This enables a simple training strategy: fine-tune a text-to-music model on music-event curves, then substitute video-event curves at inference without cross-modal training or paired data. Across OES-Pub, MovieGenBench-Music, and AIST++, V2M-Zero achieves substantial gains over paired-data baselines: 5-21% higher audio quality, 13-15% better semantic alignment, 21-52% improved temporal synchronization, and 28% higher beat alignment on dance videos. We find similar results via a large crowd-source subjective listening test. Overall, our results validate that temporal alignment through within-modality features, rather than paired cross-modal supervision, is effective for video-to-music generation. Results are available at https://genjib.github.io/v2m_zero/

Instruction set for the representation of graphs

We present IsalGraph, a method for representing the structure of any finite, simple graph as a compact string over a nine-character instruction alphabet. The encoding is executed by a small virtual machine comprising a sparse graph, a circular doubly-linked list (CDLL) of graph-node references, and two traversal pointers. Instructions either move a pointer through the CDLL or insert a node or edge into the graph. A key design property is that every string over the alphabet decodes to a valid graph, with no invalid states reachable. A greedy \emph{GraphToString} algorithm encodes any connected graph into a string in time polynomial in the number of nodes; an exhaustive-backtracking variant produces a canonical string by selecting the lexicographically smallest shortest string across all starting nodes and all valid traversal orders. We evaluate the representation on five real-world graph benchmark datasets (IAM Letter LOW/MED/HIGH, LINUX, and AIDS) and show that the Levenshtein distance between IsalGraph strings correlates strongly with graph edit distance (GED). Together, these properties make IsalGraph strings a compact, isomorphism-invariant, and language-model-compatible sequential encoding of graph structure, with direct applications in graph similarity search, graph generation, and graph-conditioned language modelling

Does AI See like Art Historians? Interpreting How Vision Language Models Recognize Artistic Style

VLMs have become increasingly proficient at a range of computer vision tasks, such as visual question answering and object detection. This includes increasingly strong capabilities in the domain of art, from analyzing artwork to generation of art. In an interdisciplinary collaboration between computer scientists and art historians, we characterize the mechanisms underlying VLMs' ability to predict artistic style and assess the extent to which they align with the criteria art historians use to reason about artistic style. We employ a latent-space decomposition approach to identify concepts that drive art style prediction and conduct quantitative evaluations, causal analysis and assessment by art historians. Our findings indicate that 73% of the extracted concepts are judged by art historians to exhibit a coherent and semantically meaningful visual feature and 90% of concepts used to predict style of a given artwork were judged relevant. In cases where an irrelevant concept was used to successfully predict style, art historians identified possible reasons for its success; for example, the model might "understand" a concept in more formal terms, such as dark/light contrasts.

RCTs & Human Uplift Studies: Methodological Challenges and Practical Solutions for Frontier AI Evaluation

Human uplift studies - or studies that measure AI effects on human performance relative to a status quo, typically using randomized controlled trial (RCT) methodology - are increasingly used to inform deployment, governance, and safety decisions for frontier AI systems. While the methods underlying these studies are well-established, their interaction with the distinctive properties of frontier AI systems remains underexamined, particularly when results are used to inform high-stakes decisions. We present findings from interviews with 16 expert practitioners with experience conducting human uplift studies in domains including biosecurity, cybersecurity, education, and labor. Across interviews, experts described a recurring tension between standard causal inference assumptions and the object of study itself. Rapidly evolving AI systems, shifting baselines, heterogeneous and changing user proficiency, and porous real-world settings strain assumptions underlying internal, external, and construct validity, complicating the interpretation and appropriate use of uplift evidence. We synthesize these challenges across key stages of the human uplift research lifecycle and map them to practitioner-reported solutions, clarifying both the limits and the appropriate uses of evidence from human uplift studies in high-stakes decision-making.

Artificial Intelligence as a Catalyst for Innovation in Software Engineering

The rapid evolution and inherent complexity of modern software requirements demand highly flexible and responsive development methodologies. While Agile frameworks have become the industry standard for prioritizing iteration, collaboration, and adaptability, software development teams continue to face persistent challenges in managing constantly evolving requirements and maintaining product quality under tight deadlines. This article explores the intersection of Artificial Intelligence (AI) and Software Engineering (SE), to analyze how AI serves as a powerful catalyst for enhancing agility and fostering innovation. The research combines a comprehensive review of existing literature with an empirical study, utilizing a survey directed at Software Engineering professionals to assess the perception, adoption, and impact of AI-driven tools. Key findings reveal that the integration of AI (specifically through Machine Learning (ML) and Natural Language Processing (NLP) )facilitates the automation of tedious tasks, from requirement management to code generation and testing . This paper demonstrates that AI not only optimizes current Agile practices but also introduces new capabilities essential for sustaining quality, speed, and innovation in the future landscape of software development.

GroundCount: Grounding Vision-Language Models with Object Detection for Mitigating Counting Hallucinations

Vision Language Models (VLMs) exhibit persistent hallucinations in counting tasks, with accuracy substantially lower than other visual reasoning tasks (excluding sentiment). This phenomenon persists even in state-of-the-art reasoning-capable VLMs. Conversely, CNN-based object detection models (ODMs) such as YOLO excel at spatial localization and instance counting with minimal computational overhead. We propose GroundCount, a framework that augments VLMs with explicit spatial grounding from ODMs to mitigate counting hallucinations. In the best case, our prompt-based augmentation strategy achieves 81.3% counting accuracy on the best-performing model (Ovis2.5-2B) - a 6.6pp improvement - while reducing inference time by 22% through elimination of hallucination-driven reasoning loops for stronger models. We conduct comprehensive ablation studies demonstrating that positional encoding is a critical component, being beneficial for stronger models but detrimental for weaker ones. Confidence scores, by contrast, introduce noise for most architectures and their removal improves performance in four of five evaluated models. We further evaluate feature-level fusion architectures, finding that explicit symbolic grounding via structured prompts outperforms implicit feature fusion despite sophisticated cross-attention mechanisms. Our approach yields consistent improvements across four of five evaluated VLM architectures (6.2--7.5pp), with one architecture exhibiting degraded performance due to incompatibility between its iterative reflection mechanisms and structured prompts. These results suggest that counting failures stem from fundamental spatial-semantic integration limitations rather than architecture-specific deficiencies, while highlighting the importance of architectural compatibility in augmentation strategies.

Contact Coverage-Guided Exploration for General-Purpose Dexterous Manipulation

Deep Reinforcement learning (DRL) has achieved remarkable success in domains with well-defined reward structures, such as Atari games and locomotion. In contrast, dexterous manipulation lacks general-purpose reward formulations and typically depends on task-specific, handcrafted priors to guide hand-object interactions. We propose Contact Coverage-Guided Exploration (CCGE), a general exploration method designed for general-purpose dexterous manipulation tasks. CCGE represents contact state as the intersection between object surface points and predefined hand keypoints, encouraging dexterous hands to discover diverse and novel contact patterns, namely which fingers contact which object regions. It maintains a contact counter conditioned on discretized object states obtained via learned hash codes, capturing how frequently each finger interacts with different object regions. This counter is leveraged in two complementary ways: (1) to assign a count-based contact coverage reward that promotes exploration of novel contact patterns, and (2) an energy-based reaching reward that guides the agent toward under-explored contact regions. We evaluate CCGE on a diverse set of dexterous manipulation tasks, including cluttered object singulation, constrained object retrieval, in-hand reorientation, and bimanual manipulation. Experimental results show that CCGE substantially improves training efficiency and success rates over existing exploration methods, and that the contact patterns learned with CCGE transfer robustly to real-world robotic systems. Project page is https://contact-coverage-guided-exploration.github.io.

AI Models

Tesslate/OmniCoder-9B


library_name: transformers base_model: Qwen/Qwen3.5-9B tags:

  • qwen3.5
  • code
  • agent
  • sft
  • omnicoder
  • tesslate license: apache-2.0 language:
  • en pipeline_tag: text-generation model-index:
  • name: OmniCoder-9B results:
    • task: type: text-generation dataset: name: AIME 2025 type: custom metrics:
      • name: pass@5 type: accuracy value: 90.0
    • task: type: text-generation dataset: name: GPQA Diamond type: custom metrics:
      • name: pass@1 type: accuracy value: 83.8
      • name: pass@3 type: accuracy value: 86.4
    • task: type: text-generation dataset: name: Terminal-Bench 2.0 type: custom metrics:
      • name: Pass Rate type: accuracy value: 28.1

<div align="center"> <img src="omnicoder-banner.png" alt="OmniCoder" width="720">

OmniCoder-9B

A 9B coding agent fine-tuned on 425K agentic trajectories.

License Base Model GGUF

Get Started | Benchmarks | GGUF Downloads


</div>

Overview

OmniCoder-9B is a 9-billion parameter coding agent model built by Tesslate, fine-tuned on top of Qwen3.5-9B's hybrid architecture (Gated Delta Networks interleaved with standard attention). It was trained on 425,000+ curated agentic coding trajectories spanning real-world software engineering tasks, tool use, terminal operations, and multi-step reasoning.

The training data was specifically built from Claude Opus 4.6 agentic and coding reasoning traces, targeting scaffolding patterns from Claude Code, OpenCode, Codex, and Droid. The dataset includes successful trajectories from models like Claude Opus 4.6, GPT-5.4, GPT-5.3-Codex, and Gemini 3.1 Pro.

The model shows strong agentic behavior: it recovers from errors (read-before-write), responds to LSP diagnostics, and uses proper edit diffs instead of full rewrites. These patterns were learned directly from the real-world agent trajectories it was trained on.

Key Features

  • Trained on Frontier Agent Traces : Built from Claude Opus 4.6, GPT-5.3-Codex, GPT-5.4, and Gemini 3.1 Pro agentic coding trajectories across Claude Code, OpenCode, Codex, and Droid scaffolding
  • Hybrid Architecture : Inherits Qwen3.5's Gated Delta Networks interleaved with standard attention for efficient long-context processing
  • 262K Native Context : Full 262,144 token context window, extensible to 1M+
  • Error Recovery : Learns read-before-write patterns, responds to LSP diagnostics, and applies minimal edit diffs instead of full rewrites
  • Thinking Mode : Supports <think>...</think> reasoning chains for complex problem decomposition
  • Apache 2.0 : Fully open weights, no restrictions

Benchmarks

<div align="center">

| Benchmark | OmniCoder-9B | Qwen3.5-9B | Qwen3-Next-80B | GPT-OSS-120B | GPT-OSS-20B | GLM-4.7-Flash | GLM 4.7 | Claude Haiku 4.5 | |:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| | AIME 2025 (pass@5) | 90 | | | | 91.7 | 91.6 | | | | GPQA Diamond (pass@1) | 83.8 | 81.7 | 77.2 | 80.1 | 71.5 | | | 73 | | GPQA Diamond (pass@3) | 86.4 | | | | | | | | | Terminal-Bench 2.0 | 23.6 | 14.6 | | | | | 33.4 | 27 |

</div>
  • GPQA Diamond pass@1: 83.8% (166/198). +2.1 points over the Qwen3.5-9B base model (81.7). At pass@3: 86.4 (171/198).
  • AIME 2025 pass@5: 90% (27/30).
  • Terminal-Bench 2.0: 23.6% (21/89). +8.99 points (+61% improvement) over the Qwen3.5-9B base model (14.6%, 13/89).

Quickstart

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "Tesslate/OmniCoder-9B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")

messages = [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "Write a Python function to find the longest common subsequence of two strings."},
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.6, top_p=0.95, top_k=20)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True))

vLLM

vllm serve Tesslate/OmniCoder-9B --tensor-parallel-size 1 --max-model-len 65536
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")
response = client.chat.completions.create(
    model="Tesslate/OmniCoder-9B",
    messages=[{"role": "user", "content": "Explain the difference between a mutex and a semaphore."}],
    temperature=0.6,
)
print(response.choices[0].message.content)

llama.cpp (GGUF)

llama-cli --hf-repo Tesslate/OmniCoder-9B-GGUF --hf-file omnicoder-9b-q4_k_m.gguf -p "Your prompt" -c 8192

All quantizations: Tesslate/OmniCoder-9B-GGUF


Training Details

| | | |:---|:---| | Base Model | Qwen3.5-9B | | Method | LoRA SFT (r=64, alpha=32) | | Dataset | 425K agentic trajectories from 5 sources | | Packing | Sample packing with 99.35% efficiency | | Hardware | 4x NVIDIA H200 (DDP) | | Framework | Axolotl | | Precision | bf16 | | Optimizer | AdamW (lr=2e-4, cosine schedule) |


Architecture

OmniCoder inherits Qwen3.5-9B's hybrid architecture:

  • Gated Delta Networks : Linear attention layers interleaved with standard attention for efficient long-range dependencies
  • VLM Backbone : Built on Qwen3_5ForConditionalGeneration

Recommended Sampling Parameters

| Parameter | Value | |:---|:---| | Temperature | 0.6 | | Top-P | 0.95 | | Top-K | 20 | | Presence Penalty | 0.0 |

For agentic / tool-calling tasks, consider lower temperature (0.2-0.4) for more deterministic behavior.


Limitations

  • Performance on non-English tasks has not been extensively evaluated
  • Tool-calling format is flexible but works best with the scaffolding patterns seen in training

Acknowledgments

Special thanks to the Axolotl team and the discussion in axolotl#3453 for helping get Qwen3.5 packing support working.


Citation

@misc{omnicoder2025,
  title={OmniCoder-9B: A Frontier Open Coding Agent},
  author={Tesslate},
  year={2025},
  url={https://huggingface.co/Tesslate/OmniCoder-9B}
}

<div align="center">

Built by Tesslate

</div>

Author: Tesslate

Likes: 45

Downloads: 0

Tags: transformers, safetensors, qwen3_5, image-text-to-text, qwen3.5, code, agent, sft, omnicoder, tesslate, text-generation, conversational, en, base_model:Qwen/Qwen3.5-9B, base_model:finetune:Qwen/Qwen3.5-9B, license:apache-2.0, model-index, endpoints_compatible, region:us

Tesslate/OmniCoder-9B-GGUF


base_model: Tesslate/OmniCoder-9B tags:

  • llama-cpp
  • gguf
  • qwen3.5
  • omnicoder
  • tesslate
  • code
  • agent license: apache-2.0

<div align="center"> <img src="https://huggingface.co/Tesslate/OmniCoder-9B/resolve/main/omnicoder-banner.png" alt="OmniCoder" width="720">

OmniCoder-9B-GGUF

GGUF quantizations of OmniCoder-9B

License Full Weights

</div>

Available Quantizations

| Quantization | Size | Use Case | |:---|---:|:---| | Q2_K | ~3.8 GB | Extreme compression, lowest quality | | Q3_K_S | ~4.3 GB | Small footprint | | Q3_K_M | ~4.6 GB | Small footprint, balanced | | Q3_K_L | ~4.9 GB | Small footprint, higher quality | | Q4_0 | ~5.3 GB | Good balance | | Q4_K_S | ~5.4 GB | Good balance | | Q4_K_M | ~5.7 GB | Recommended for most users | | Q5_0 | ~6.3 GB | High quality | | Q5_K_S | ~6.3 GB | High quality | | Q5_K_M | ~6.5 GB | High quality, balanced | | Q6_K | ~7.4 GB | Near-lossless | | Q8_0 | ~9.5 GB | Highest quality quantization | | BF16 | ~17.9 GB | Full precision |

Usage

# Install llama.cpp
brew install llama.cpp  # macOS
# or build from source: https://github.com/ggml-org/llama.cpp

# Interactive chat
llama-cli --hf-repo Tesslate/OmniCoder-9B-GGUF --hf-file omnicoder-9b-q4_k_m.gguf -p "Your prompt" -c 8192

# Server mode (OpenAI-compatible API)
llama-server --hf-repo Tesslate/OmniCoder-9B-GGUF --hf-file omnicoder-9b-q4_k_m.gguf -c 8192

<div align="center">

Built by Tesslate | See full model card: OmniCoder-9B

</div>

Author: Tesslate

Likes: 23

Downloads: 0

Tags: gguf, llama-cpp, qwen3.5, omnicoder, tesslate, code, agent, base_model:Tesslate/OmniCoder-9B, base_model:quantized:Tesslate/OmniCoder-9B, license:apache-2.0, endpoints_compatible, region:us, conversational

AesSedai/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF


base_model:

  • nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16

Updates

03/12/2026

I uploaded the wrong splits for Q4_K_M / Q5_K_M and have corrected that now with the changes mentioned in the 03/11 update. Also added an IQ3_S quant now that there is a PR from @bartowski to fix the IQ4_NL quantization crash.

03/11/2026

I've adjusted the Q4_K_M and Q5_K_M to use Q5_0 for the ffn_down_exps tensor, which brings the Q5_K_M quant size down substantially.

Description

This repo contains specialized MoE-quants for NVIDIA-Nemotron-3-Super-120B-A12B-BF16. The idea being that given the huge size of the FFN tensors compared to the rest of the tensors in the model, it should be possible to achieve a better quality while keeping the overall size of the entire model smaller compared to a similar naive quantization. To that end, the quantization type default is kept in high quality and the FFN UP + FFN GATE tensors are quanted down along with the FFN DOWN tensors.

Notes

This model is a little weird, architecturally. There isn't a ffn_gate_exps tensor in it, and the ffn_down_exps tensor has 2688 elements in it which means that it is not compatible with most Q*_K quantizations.

So you may notice that the ffn_down_exps here is a little odd, and producing an actual IQ3_S-sized quant like I normally do is tricky since the IQ4_NL quantization type is also not behaving well.

I've chosen to upload these 3 quants for now and hope that there will be some improvements soon.

| Quant | Size | Mixture | PPL | 1-(Mean PPL(Q)/PPL(base)) | KLD | | :----- | :------------------- | :---------------------- | :------------------ | :------------------------ | :------------------ | | Q5_K_M | 80.27 GiB (5.71 BPW) | Q8_0 / Q5_K / X / Q5_0 | 4.590127 ± 0.027865 | +0.0817% | 0.007533 ± 0.000042 | | Q4_K_M | 73.70 GiB (5.25 BPW) | Q8_0 / Q4_K / X / Q5_0 | 4.600659 ± 0.027947 | +0.3113% | 0.010532 ± 0.000072 | | IQ4_XS | 63.45 GiB (4.52 BPW) | Q8_0 / IQ3_S / X / Q4_1 | 4.647848 ± 0.028308 | +1.3402% | 0.022996 ± 0.000191 | | IQ3_S | 52.66 GiB (3.75 BPW) | Q6_K / IQ2_S / X / IQ4_NL | 4.787999 ± 0.029268 | +4.3960% | 0.059260 ± 0.000528 |

kld_graph ppl_graph

Author: AesSedai

Likes: 6

Downloads: 40

Tags: gguf, base_model:nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16, base_model:quantized:nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16, endpoints_compatible, region:us, imatrix, conversational

Intel/Step-3.5-Flash-int4-mixed-AutoRound


base_model:

  • stepfun-ai/Step-3.5-Flash pipeline_tag: text-generation

Model Details

This model is a mixed int4 model with group_size 128 and symmetric quantization of stepfun-ai/Step-3.5-Flash generated by intel/auto-round via RTN (no algorithm tuning). Please follow the license of the original model.

How To Use

INT4 Inference

start a vllm server:

vllm serve Intel/Step-3.5-Flash-int4-mixed-AutoRound \
  --served-model-name step3p5-flash-int4-mixed \
  --tensor-parallel-size 1 \
  --enable-expert-parallel \
  --disable-cascade-attn \
  --reasoning-parser step3p5 \
  --enable-auto-tool-choice \
  --tool-call-parser step3p5 \
  --hf-overrides '{"num_nextn_predict_layers": 1}' \
  --speculative_config '{"method": "step3p5_mtp", "num_speculative_tokens": 1}' \
  --trust-remote-code

Generate the Model

hf download stepfun-ai/Step-3.5-Flash --local-dir Step-3.5-Flash
auto_round ./Step-3.5-Flash --scheme W4A16 --iters 0 --disable_opt_rtn  --ignore_layers eh_proj,shared_head,layers.45 --layer_config "{mlp:{bits:8,data_type:int},self_attn:{bits:8,data_type:int},layers.46:{bits:8,data_type:int},layers.47:{bits:8,data_type:int}}"

Ethical Considerations and Limitations

The model can produce factually incorrect output, and should not be relied on to produce factually accurate information. Because of the limitations of the pretrained model and the finetuning datasets, it is possible that this model could generate lewd, biased or otherwise offensive outputs. Therefore, before deploying any applications of the model, developers should perform safety testing.

Caveats and Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. Here are a couple of useful links to learn more about Intel's AI software:

Disclaimer

The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please consult an attorney before using this model for commercial purposes.

Cite

@article{cheng2023optimize,
  title={Optimize weight rounding via signed gradient descent for the quantization of llms},
  author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi},
  journal={arXiv preprint arXiv:2309.05516},
  year={2023}
}

arxiv github

Author: Intel

Likes: 4

Downloads: 0

Tags: safetensors, step3p5, text-generation, conversational, custom_code, arxiv:2309.05516, base_model:stepfun-ai/Step-3.5-Flash, base_model:quantized:stepfun-ai/Step-3.5-Flash, 4-bit, auto-round, region:us

g-group-ai-lab/gipformer-65M-rnnt

Author: g-group-ai-lab

Likes: 3

Downloads: 0

Tags: onnx, region:us

DLLM-Agent/openPangu-7B-Diffusion-DeepDiver


pipeline_tag: text-generation

Open-Source Pangu openPangu-7B-Diffusion-DeepDiver

中文 | English

1. Introduction

openPangu-7B-Diffusion-DeepDiver is a 7B-parameter language model based on block diffusion large language models (Diffusion LLM), specifically trained and fine-tuned for multi-agent scenarios (including tool invocation, information retrieval, and multi-step decision-making). Its underlying architecture and inference pipeline follow the design of openPangu-R-7B-Diffusion (including block-wise denoising and bidirectional attention within blocks), thus maintaining consistent structures and interfaces for single-pass generation and parallel decoding.

For complete evaluations and training details, please refer to the technical report “DLLM Agent: See Farther, Run Faster” (arXiv:2602.07451v2).

  • openPangu-7B-Diffusion-DeepDiver: Agent model with a context length of 32k.

Key Features

  • Uses the same DLLM architecture and iterative inference pipeline as openPangu-R-7B-Diffusion.
  • Trained and fine-tuned with data and objectives tailored for in-depth agent research, making it more robust in multi-round tool invocation and planning tasks.
  • Introduces context-clean corruption and span-aware attention alignment to reduce noise propagation in multi-round agent dialogues and improve the reliability of tool invocation formats.

Inference

Context_Causal_Block_Diffusion_LLM

openPangu-7B-Diffusion-DeepDiver adopts context-causal block diffusion decoding, performing diffusion decoding block by block. Within each block, full attention is applied, while causal attention is used for preceding context. Once all tokens in a block are decoded, the entire block is stored in the historical KV cache with causal masking, and decoding proceeds to the first token of the next block.

  • Supports variable-length inference and KV caching.
  • Flexible context length, not limited by block size.
  • Supports both autoregressive and block diffusion decoding.
  • Uses confidence-threshold sampling, achieving up to 2.5× throughput improvement over standard autoregressive decoding.
  • Similar to Fast dLLMv2, small blocks can be configured within each block to balance throughput and performance, with optimal results typically at small block sizes of 4 or 8.
Inference in Agent Workflows

Integrated into the DeepDiver v2 Agent workflow, the model applies DLLM’s iterative denoising inference strategy for each round of tool generation.

Deepdiver v2 is a planner-centered MAS (Multi-Agent System) architecture that coordinates multiple executors.

For detailed information, refer to its technical report.

Training

alt text

Training Corpus

The model is trained on 11k specially collected or synthesized agent trajectory datasets (including planner → seeker multi-agent interactions, real tool calls, and tool-return traces). These data aim to help the model learn to generate semantically consistent and format-compliant tool invocation instructions in multi-round interactions. See the “Agent-oriented Fine-tuning” section in the technical report for details.

Supervision Method

Cross-entropy losses for both diffusion and autoregressive models are jointly optimized during training, ensuring stability and preserving reliable left-to-right generation.

Masking and Attention Alignment

To address information contamination caused by diffusion when multi-round contexts and tool outputs are mixed, training applies context-clean corruption to mask irrelevant context segments and span-aware attention alignment within generation ranges. Experiments on agent datasets show that both techniques improve final information retrieval scores.

2. Model Architecture

| | openPangu-7B-Diffusion-DeepDiver | | :----------------------------: | :-------------------------: | | Architecture | Dense | | Parameters (Non-Embedding) | 7B | | Number of Layers | 34 | | Hidden Dimension | 12800 | | Attention Mechanism | GQA | | Number of Attention Heads | 32 for Q, 8 for KV | | Vocabulary Size | 153k | | Context Length | 32k | | Continued Training Tokens | 700B |

3. Evaluation Results

Table 1. Comparison results on a 110-question subset of BrowseComp-zh.

| Method | Accuracy (%) | Tool Calls | Agent Rounds | Tool Failure Rate | | ---------------------------------- | -----------: | ---------: | -----------: | ----------------: | | AR Agent (autoregressive backbone) | 15.5 | 7.5 | 14.8 | 1.9% | | DLLM Agent (diffusion backbone) | 15.5 | 6.7 | 13.0 | 6.4% |

Although the final accuracy is comparable to AR on this subset, DLLM requires fewer tool calls and sub-agent rounds, and achieves about 30% average end-to-end latency reduction. However, DLLM shows a higher tool failure rate, indicating that it is still less stable than AR models.

4. Deployment and Usage

4.1 Environment Setup

Hardware Requirements

Atlas 800T A2 (64GB). For drivers and firmware, see: [Atlas 800T A2].

Software Environment
  • OS: Linux (openEuler ≥ 24.03 recommended)
  • CANN == 8.1.RC1. See [CANN Install]
  • python == 3.10
  • torch == 2.6.0
  • torch-npu == 2.6.0
  • transformers == 4.53.2

The above configurations have been verified. Higher versions may be supported. Please submit an issue if you have questions.

4.2 Inference Examples

Below is a simple example of using openPangu-7B-Diffusion-DeepDiver with the transformers framework and the Deepdiver v2 Agent framework.

Loading and Running

Before running, modify generate.py to specify the model path.

cd inference
python generate.py

For optimal throughput, set sampling parameters to alg="confidence_threshold", threshold=0.9, num_small_blocks=1, and choose an appropriate batch size based on hardware.

Service Deployment

Download the lightweight service script, place it in the model directory, and start the service:

python launch_server.py --load /path/to/model --port 9999

Deepdiver v2

Download the Deepdiver v2 package (no model weights required) and install it following official documentation.

Copy env.template to config/.env, set MODEL_REQUEST_URL to the model service URL, and modify MODEL_NAME to match the deployed model name (default: local-diffusion-llm).

Start the MCP service:

python src/tools/mcp_server_standard.py

Send a query to Deepdiver v2:

python cli/demo.py -q "今天北京的天气怎么样?"

For more usage details, refer to the official repository.

Currently, openPangu-7B-Diffusion-DeepDiver has only been trained and tested within the Deepdiver v2 framework. It has not been adapted for other agent frameworks or tasks, and performance on other setups is not guaranteed.

5. License

When using the model or its outputs, please cite the technical report: “DLLM Agent: See Farther, Run Faster” (arXiv:2602.07451v2).

Unless otherwise specified, openPangu-7B-Diffusion-DeepDiver is licensed under the OPENPANGU MODEL LICENSE AGREEMENT VERSION 1.0, which aims to promote the development of AI technologies. See the LICENSE file in the repository root for details.

6. Disclaimer

Due to inherent technical limitations and the nature of AI-generated content, Huawei makes no guarantees regarding the following:

  • The generated outputs may contain defects, inaccuracies, or inappropriate content, and do not represent Huawei’s views.
  • The model is not guaranteed to be 100% accurate, reliable, complete, timely, secure, error-free, uninterrupted, or stable.
  • The outputs do not constitute advice or decisions and do not guarantee authenticity, completeness, accuracy, legality, or usefulness. They cannot replace professional advice in medical, legal, or other domains. Users must make independent judgments, and Huawei assumes no responsibility.

7. Feedback

For suggestions or feedback, please submit an issue or contact: openPangu@huawei.com.

8. Citation

@article{zhen2026dllm,
  title={DLLM Agent: See Farther, Run Faster},
  author={Zhen, Huiling and Lin, Weizhe and Liu, Renxi and Han, Kai and Li, Yiming and Tian, Yuchuan and Chen, Hanting and Li, Xiaoguang and Li, Xiaosong and Chen, Chen and others},
  journal={arXiv preprint arXiv:2602.07451},
  year={2026}
}

Author: DLLM-Agent

Likes: 3

Downloads: 0

Tags: safetensors, PanguEmbedded, text-generation, conversational, custom_code, arxiv:2602.07451, region:us

inferencerlabs/NVIDIA-Nemotron-3-Super-120B-A12B-MLX-9bit


language: en library_name: mlx tags:

  • quantized
  • mlx base_model:
  • nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 pipeline_tag: text-generation

See NVIDIA-Nemotron-3-Super-120B-A12B MLX in action - demonstration video

Tested on a M3 Ultra 512GB RAM using Inferencer app v1.10.6

  • Single inference ~31.2 tokens/s @ 1000 tokens (in debug mode - release figured coming soon)
  • Batched inference ~ total tokens/s across five inferences
  • Memory usage: ~130 GiB
<p style="margin-bottom:0px;"> <strong>q9bit quant typically achieves near lossless accuracy in our coding test</strong> </p> <table style="border-collapse: collapse; text-align:center; margin-top:10px; margin-bottom:0px;"> <thead> <tr><th>Quantization</th><th>Perplexity</th><th>Token Accuracy</th><th>Missed Divergence</th></tr> </thead> <tbody> <tr><td><strong>q3.5</strong></td><td>168.0</td><td>43.45%</td><td>72.57%</td></tr> <tr><td><strong>q4.5</strong></td><td>1.33593</td><td>91.65%</td><td>27.61%</td></tr> <tr><td><strong>q4.8</strong></td><td>1.28125</td><td>93.75%</td><td>21.15%</td></tr> <tr><td><strong>q5.5</strong></td><td>1.23437</td><td>95.05%</td><td>17.28%</td></tr> <tr><td><strong>q6.5</strong></td><td>1.21875</td><td>96.95%</td><td>12.03%</td></tr> <tr><td><strong>q8.5</strong></td><td>1.21093</td><td>97.55%</td><td>10.50%</td></tr> <tr><td><strong>q9</strong></td><td>1.21093</td><td>97.55%</td><td>10.50%</td></tr> <tr><td><strong>Base</strong></td><td>1.20312</td><td>100.0%</td><td>0.000%</td></tr> </tbody> </table>
Quantized with a modified version of MLX
For more details see demonstration video or visit NVIDIA-Nemotron-3-Super-120B-A12B-BF16.

Disclaimer

We are not the creator, originator, or owner of any model listed. Each model is created and provided by third parties. Models may not always be accurate or contextually appropriate. You are responsible for verifying the information before making important decisions. We are not liable for any damages, losses, or issues arising from its use, including data loss or inaccuracies in AI-generated content.

Author: inferencerlabs

Likes: 3

Downloads: 0

Tags: mlx, safetensors, nemotron_h, quantized, text-generation, conversational, custom_code, en, base_model:nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16, base_model:quantized:nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16, 8-bit, region:us

rudyon/rudygpt


language:

  • en license: mit tags:
  • text-generation
  • causal-lm
  • pretrained datasets:
  • HuggingFaceFW/fineweb-edu metrics:
  • perplexity library_name: transformers pipeline_tag: text-generation

rudygpt

A 124M parameter causal language model trained from scratch using rudyon/pipeline on the HuggingFaceFW/fineweb-edu dataset.

Training was done on a vast.ai instance with 2x 4090S Ti. Training cost was about ~$10.

usage

import torch
from transformers import AutoTokenizer
# download model.py and pytorch_model.bin manually or via hf_hub_download
from model import GPT, GPTConfig

device = 'cuda' if torch.cuda.is_available() else 'cpu'

tokenizer = AutoTokenizer.from_pretrained("rudyon/rudygpt")
model = GPT(GPTConfig(depth=12, vocab_size=50304))
state_dict = torch.load("pytorch_model.bin", map_location='cpu')
model.load_state_dict(state_dict)
model.eval()
print(model.generate("Hello!"))

Author: rudyon

Likes: 2

Downloads: 0

Tags: transformers, pytorch, rudygpt, text-generation, causal-lm, pretrained, custom_code, en, dataset:HuggingFaceFW/fineweb-edu, license:mit, endpoints_compatible, region:us

internlm/Visual-ERM


license: apache-2.0

Author: internlm

Likes: 2

Downloads: 0

Tags: safetensors, qwen3_vl, license:apache-2.0, region:us

CYX1998/Meissa-4B


license: apache-2.0 base_model: Qwen/Qwen3-VL-4B-Instruct library_name: transformers pipeline_tag: image-text-to-text language:

  • en tags:
  • medical
  • agent
  • tool-calling
  • multi-agent
  • multimodal
  • qwen3-vl

Meissa-4B: Multi-modal Medical Agentic Intelligence

<p align="center"> <a href="https://arxiv.org/abs/2603.09018"><img src="https://img.shields.io/badge/arXiv-2603.09018-b31b1b.svg" alt="arXiv"></a> <a href="https://huggingface.co/datasets/CYX1998/Meissa-SFT"><img src="https://img.shields.io/badge/HuggingFace-Meissa--SFT-blue" alt="Dataset"></a> <a href="https://github.com/Schuture/Meissa"><img src="https://img.shields.io/badge/GitHub-Meissa-black" alt="GitHub"></a> </p>

Meissa-4B is a lightweight 4B-parameter medical multi-modal LLM with full agentic capability. Instead of relying on proprietary frontier models (GPT, Gemini), Meissa brings tool calling, multi-agent collaboration, and clinical simulation offline by distilling structured trajectories from frontier agent systems into a compact vision-language model.

Key Features

  • 4 agentic paradigms in a single model: continuous tool calling, interleaved thinking with images, multi-agent collaboration, and multi-turn clinical simulation
  • Offline deployment: runs entirely locally with vLLM, no API calls needed
  • Tool calling: native <tool_call> support via Hermes format, compatible with vLLM's tool-call parser
  • Thinking: built-in <think> chain-of-thought reasoning before actions

Model Details

| | | |---|---| | Base model | Qwen3-VL-4B-Instruct | | Architecture | Qwen3VLForConditionalGeneration | | Parameters | 4B | | Precision | bfloat16 | | Training method | LoRA SFT (rank=32, alpha=64), merged | | Training data | 43,210 medical agentic trajectories (open subset) | | Training framework | LLaMA-Factory | | Context length | 8,192 tokens (training) |

Quickstart

Load with Transformers

from transformers import AutoModelForCausalLM, AutoProcessor

model = AutoModelForCausalLM.from_pretrained(
    "CYX1998/Meissa-4B",
    trust_remote_code=True,
    torch_dtype="bfloat16",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("CYX1998/Meissa-4B", trust_remote_code=True)

Serve with vLLM (Recommended)

For agentic use cases, serve Meissa with vLLM to enable tool calling:

python -m vllm.entrypoints.openai.api_server \
    --model CYX1998/Meissa-4B \
    --port 8877 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.85 \
    --dtype bfloat16 \
    --enable-auto-tool-choice \
    --tool-call-parser hermes

# Set the endpoint
export OPENAI_BASE_URL="http://127.0.0.1:8877/v1"
export OPENAI_API_KEY="dummy"

The --enable-auto-tool-choice --tool-call-parser hermes flags are required for tool calling.

Example: Tool Calling

from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8877/v1", api_key="dummy")

tools = [{
    "type": "function",
    "function": {
        "name": "ChestXRayClassifier",
        "description": "Classify pathologies in a chest X-ray image.",
        "parameters": {
            "type": "object",
            "properties": {
                "image_path": {"type": "string", "description": "Path to the chest X-ray image"}
            },
            "required": ["image_path"]
        }
    }
}]

response = client.chat.completions.create(
    model="CYX1998/Meissa-4B",
    messages=[{"role": "user", "content": "Analyze this chest X-ray: /path/to/cxr.jpg"}],
    tools=tools,
)
print(response.choices[0].message)

Supported Agentic Frameworks

| Framework | Description | Tools | |-----------|-------------|-------| | I: Continuous Tool Calling | Sequential tool use for radiology analysis | 8 chest X-ray tools (classifier, report generator, VQA, segmentation, etc.) | | II: Interleaved Thinking with Images | Iterative visual reasoning with zoom | ZoomInSubfigure, SegmentRegion, Terminate | | III: Multi-Agent Collaboration | Multi-agent medical consultation | AssessDifficulty, RecruitExperts, ConsultExperts, FacilitateDebate | | IV: Clinical Simulation | Multi-turn doctor-patient interaction | RequestPhysicalExam, RequestTest, Terminate |

Training Data

Trained on 43,210 medical agentic SFT trajectories distilled from Gemini:

| Framework | Samples | Source Datasets | |-----------|---------|----------------| | I: Continuous Tool Calling | 4,898 | MIMIC-CXR-VQA | | II: Interleaved Thinking | 15,211 | PathVQA, MIMIC-CXR-VQA, SLAKE, VQA-RAD | | III: Multi-Agent Collaboration | 15,427 | MIMIC-CXR-VQA, PathVQA, MedQA, PubMedQA | | IV: Clinical Simulation | 7,674 | MedQA, MIMIC-CXR |

The open-source subset (25,018 samples) is available at CYX1998/Meissa-SFT.

Evaluation

Meissa-4B matches or exceeds GPT-4o and Gemini-3-flash on multiple medical agentic benchmarks while being deployable offline on a single GPU. See our paper for full results.

Limitations

  • Not for clinical use: This model is a research prototype and should NOT be used for real clinical decision-making.
  • English only: Trained and evaluated on English medical data only.
  • Domain scope: Primarily trained on radiology, pathology, and general clinical reasoning. Performance on other medical specialties may vary.
  • Hallucination: Like all LLMs, Meissa may generate plausible but incorrect medical information.

Citation

@inproceedings{chen2026meissa,
  title={Meissa: Multi-modal Medical Agentic Intelligence},
  author={Chen, Yixiong and Bai, Xinyi and Pan, Yue and Zhou, Zongwei and Yuille, Alan},
  journal={arXiv preprint arXiv:2603.09018},
  year={2026}
}

License

This model is released under Apache 2.0. The base model Qwen3-VL-4B-Instruct is subject to the Qwen License.

Author: CYX1998

Likes: 2

Downloads: 0

Tags: transformers, pytorch, safetensors, qwen3_vl, image-text-to-text, medical, agent, tool-calling, multi-agent, multimodal, qwen3-vl, conversational, en, arxiv:2603.09018, base_model:Qwen/Qwen3-VL-4B-Instruct, base_model:finetune:Qwen/Qwen3-VL-4B-Instruct, license:apache-2.0, endpoints_compatible, region:us