Todays AI Summary

AI Developments: 3D Reconstruction, LLM Reasoning, and Function Calling Models Emerge

Here's a look at the latest developments in AI, covering new models and research papers.

Research Highlights

Several interesting research papers have surfaced, focusing on improving the reasoning and capabilities of AI models.

  • Spatia: Video Generation with Updatable Spatial Memory introduces a framework for video generation that uses a 3D scene point cloud as spatial memory, enhancing spatial consistency and enabling camera control and 3D-aware editing.
  • Predictive Concept Decoders: Training Scalable End-to-End Interpretability Assistants proposes training interpretability assistants to predict model behavior from activations, improving the detection of jailbreaks and latent user attributes.
  • mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs presents a Video-Action Model (VAM) that combines a pretrained video model with an action decoder, improving sample efficiency and convergence speed in robotic manipulation tasks.
  • BashArena: A Control Setting for Highly Privileged AI Agents introduces a setting for studying AI control techniques in security-critical environments, evaluating LLMs on their ability to complete tasks and perform sabotage undetected.
  • Can LLMs Guide Their Own Exploration? Gradient-Guided Reinforcement Learning for LLM Reasoning introduces G2RL, a gradient guided reinforcement learning framework in which exploration is driven by the model's own first order update geometry, improving reasoning abilities.
  • Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers evaluates LatentQA-trained models as Activation Oracles (AOs), finding that they can recover information fine-tuned into a model and match or exceed prior white-box baselines on downstream tasks.
  • Explaining the Reasoning of Large Language Models Using Attribution Graphs introduces the Context Attribution via Graph Explanations (CAGE) framework, which improves context attribution faithfulness by quantifying how each generation is influenced by the prompt and prior generations.
  • Stepwise Think-Critique: A Unified Framework for Robust and Interpretable LLM Reasoning proposes Stepwise Think-Critique (STC), a unified framework that interleaves reasoning and self-critique at each step within a single model, demonstrating strong critic-thinking capabilities and producing more interpretable reasoning traces.
  • PPSEBM: An Energy-Based Model with Progressive Parameter Selection for Continual Learning introduces PPSEBM, a framework that integrates an Energy-Based Model (EBM) with Progressive Parameter Selection (PPS) to effectively address catastrophic forgetting in continual learning for natural language processing tasks.
  • Artism: AI-Driven Dual-Engine System for Art Generation and Critique proposes a dual-engine AI architectural method designed to address the complex problem of exploring potential trajectories in the evolution of art.

Model Releases

Several new models have been released, showcasing advancements in different areas of AI.

  • unsloth/functiongemma-270m-it-GGUF (13 likes): A GGUF format of Google's FunctionGemma model, designed for function calling tasks. It is intended to be fine-tuned for specific function-calling tasks and is optimized for deployment in resource-limited environments.
  • facebook/map-anything-v1 (11 likes): A transformer model for 3D reconstruction, supporting tasks like multi-view stereo and depth estimation. It directly regresses the factored metric 3D geometry of a scene given various types of modalities as inputs.
  • cybermotaz/nemotron3-nano-nvfp4-w4a16 (2 likes, 153 downloads): An optimized, quantized version of NVIDIA's Nemotron 3 Nano 30B, designed for maximum inference performance on NVIDIA Blackwell GPUs. It achieves significant memory reduction and speedup.
  • AIDC-AI/Omni-View (4 likes): A model extending multi-modal understanding and generation to 3D scenes based on multiview images, achieving state-of-the-art performance on the VSI-Bench benchmark.
  • mrfakename/sam-audio-large (2 likes): A model for isolating sounds in audio using text, visual, or temporal prompts. It can separate specific sounds from complex audio mixtures based on natural language descriptions, visual cues from video, or time spans.

Key Takeaways

  • 3D and Video Advancements: Significant progress is being made in 3D reconstruction and video generation, with models like MapAnything and Spatia offering new capabilities.
  • LLM Reasoning and Interpretability: Research continues to focus on improving the reasoning abilities and interpretability

AI Papers for 2026-02-05

PLATE: Plasticity-Tunable Efficient Adapters for Geometry-Aware Continual Learning

We develop a continual learning method for pretrained models that \emph{requires no access to old-task data}, addressing a practical barrier in foundation model adaptation where pretraining distributions are often unavailable. Our key observation is that pretrained networks exhibit substantial \emph{geometric redundancy}, and that this redundancy can be exploited in two complementary ways. First, redundant neurons provide a proxy for dominant pretraining-era feature directions, enabling the construction of approximately protected update subspaces directly from pretrained weights. Second, redundancy offers a natural bias for \emph{where} to place plasticity: by restricting updates to a subset of redundant neurons and constraining the remaining degrees of freedom, we obtain update families with reduced functional drift on the old-data distribution and improved worst-case retention guarantees. These insights lead to \textsc{PLATE} (\textbf{Pla}sticity-\textbf{T}unable \textbf{E}fficient Adapters), a continual learning method requiring no past-task data that provides explicit control over the plasticity-retention trade-off. PLATE parameterizes each layer with a structured low-rank update $Ξ”W = B A Q^\top$, where $B$ and $Q$ are computed once from pretrained weights and kept frozen, and only $A$ is trained on the new task. The code is available at https://github.com/SalesforceAIResearch/PLATE.

PrevizWhiz: Combining Rough 3D Scenes and 2D Video to Guide Generative Video Previsualization

In pre-production, filmmakers and 3D animation experts must rapidly prototype ideas to explore a film's possibilities before fullscale production, yet conventional approaches involve trade-offs in efficiency and expressiveness. Hand-drawn storyboards often lack spatial precision needed for complex cinematography, while 3D previsualization demands expertise and high-quality rigged assets. To address this gap, we present PrevizWhiz, a system that leverages rough 3D scenes in combination with generative image and video models to create stylized video previews. The workflow integrates frame-level image restyling with adjustable resemblance, time-based editing through motion paths or external video inputs, and refinement into high-fidelity video clips. A study with filmmakers demonstrates that our system lowers technical barriers for film-makers, accelerates creative iteration, and effectively bridges the communication gap, while also surfacing challenges of continuity, authorship, and ethical consideration in AI-assisted filmmaking.

Accelerating Scientific Research with Gemini: Case Studies and Common Techniques

Recent advances in large language models (LLMs) have opened new avenues for accelerating scientific research. While models are increasingly capable of assisting with routine tasks, their ability to contribute to novel, expert-level mathematical discovery is less understood. We present a collection of case studies demonstrating how researchers have successfully collaborated with advanced AI models, specifically Google's Gemini-based models (in particular Gemini Deep Think and its advanced variants), to solve open problems, refute conjectures, and generate new proofs across diverse areas in theoretical computer science, as well as other areas such as economics, optimization, and physics. Based on these experiences, we extract common techniques for effective human-AI collaboration in theoretical research, such as iterative refinement, problem decomposition, and cross-disciplinary knowledge transfer. While the majority of our results stem from this interactive, conversational methodology, we also highlight specific instances that push beyond standard chat interfaces. These include deploying the model as a rigorous adversarial reviewer to detect subtle flaws in existing proofs, and embedding it within a "neuro-symbolic" loop that autonomously writes and executes code to verify complex derivations. Together, these examples highlight the potential of AI not just as a tool for automation, but as a versatile, genuine partner in the creative process of scientific discovery.

AutoFigure: Generating and Refining Publication-Ready Scientific Illustrations

High-quality scientific illustrations are crucial for effectively communicating complex scientific and technical concepts, yet their manual creation remains a well-recognized bottleneck in both academia and industry. We present FigureBench, the first large-scale benchmark for generating scientific illustrations from long-form scientific texts. It contains 3,300 high-quality scientific text-figure pairs, covering diverse text-to-illustration tasks from scientific papers, surveys, blogs, and textbooks. Moreover, we propose AutoFigure, the first agentic framework that automatically generates high-quality scientific illustrations based on long-form scientific text. Specifically, before rendering the final result, AutoFigure engages in extensive thinking, recombination, and validation to produce a layout that is both structurally sound and aesthetically refined, outputting a scientific illustration that achieves both structural completeness and aesthetic appeal. Leveraging the high-quality data from FigureBench, we conduct extensive experiments to test the performance of AutoFigure against various baseline methods. The results demonstrate that AutoFigure consistently surpasses all baseline methods, producing publication-ready scientific illustrations. The code, dataset and huggingface space are released in https://github.com/ResearAI/AutoFigure.

Adaptive Evidence Weighting for Audio-Spatiotemporal Fusion

Many machine learning systems have access to multiple sources of evidence for the same prediction target, yet these sources often differ in reliability and informativeness across inputs. In bioacoustic classification, species identity may be inferred both from the acoustic signal and from spatiotemporal context such as location and season; while Bayesian inference motivates multiplicative evidence combination, in practice we typically only have access to discriminative predictors rather than calibrated generative models. We introduce \textbf{F}usion under \textbf{IN}dependent \textbf{C}onditional \textbf{H}ypotheses (\textbf{FINCH}), an adaptive log-linear evidence fusion framework that integrates a pre-trained audio classifier with a structured spatiotemporal predictor. FINCH learns a per-sample gating function that estimates the reliability of contextual information from uncertainty and informativeness statistics. The resulting fusion family \emph{contains} the audio-only classifier as a special case and explicitly bounds the influence of contextual evidence, yielding a risk-contained hypothesis class with an interpretable audio-only fallback. Across benchmarks, FINCH consistently outperforms fixed-weight fusion and audio-only baselines, improving robustness and error trade-offs even when contextual information is weak in isolation. We achieve state-of-the-art performance on CBI and competitive or improved performance on several subsets of BirdSet using a lightweight, interpretable, evidence-based approach. Code is available: \texttt{\href{https://anonymous.4open.science/r/birdnoise-85CD/README.md}{anonymous-repository}}

Conformal Thinking: Risk Control for Reasoning on a Compute Budget

Reasoning Large Language Models (LLMs) enable test-time scaling, with dataset-level accuracy improving as the token budget increases, motivating adaptive reasoning -- spending tokens when they improve reliability and stopping early when additional computation is unlikely to help. However, setting the token budget, as well as the threshold for adaptive reasoning, is a practical challenge that entails a fundamental risk-accuracy trade-off. We re-frame the budget setting problem as risk control, limiting the error rate while minimizing compute. Our framework introduces an upper threshold that stops reasoning when the model is confident (risking incorrect output) and a novel parametric lower threshold that preemptively stops unsolvable instances (risking premature stoppage). Given a target risk and a validation set, we use distribution-free risk control to optimally specify these stopping mechanisms. For scenarios with multiple budget controlling criteria, we incorporate an efficiency loss to select the most computationally efficient exiting mechanism. Empirical results across diverse reasoning tasks and models demonstrate the effectiveness of our risk control approach, demonstrating computational efficiency gains from the lower threshold and ensemble stopping mechanisms while adhering to the user-specified risk target.

Antidistillation Fingerprinting

Model distillation enables efficient emulation of frontier large language models (LLMs), creating a need for robust mechanisms to detect when a third-party student model has trained on a teacher model's outputs. However, existing fingerprinting techniques that could be used to detect such distillation rely on heuristic perturbations that impose a steep trade-off between generation quality and fingerprinting strength, often requiring significant degradation of utility to ensure the fingerprint is effectively internalized by the student. We introduce antidistillation fingerprinting (ADFP), a principled approach that aligns the fingerprinting objective with the student's learning dynamics. Building upon the gradient-based framework of antidistillation sampling, ADFP utilizes a proxy model to identify and sample tokens that directly maximize the expected detectability of the fingerprint in the student after fine-tuning, rather than relying on the incidental absorption of the un-targeted biases of a more naive watermark. Experiments on GSM8K and OASST1 benchmarks demonstrate that ADFP achieves a significant Pareto improvement over state-of-the-art baselines, yielding stronger detection confidence with minimal impact on utility, even when the student model's architecture is unknown.

Enhancing Imbalanced Node Classification via Curriculum-Guided Feature Learning and Three-Stage Attention Network

Imbalanced node classification in graph neural networks (GNNs) happens when some labels are much more common than others, which causes the model to learn unfairly and perform badly on the less common classes. To solve this problem, we propose a Curriculum-Guided Feature Learning and Three-Stage Attention Network (CL3AN-GNN), a learning network that uses a three-step attention system (Engage, Enact, Embed) similar to how humans learn. The model begins by engaging with structurally simpler features, defined as (1) local neighbourhood patterns (1-hop), (2) low-degree node attributes, and (3) class-separable node pairs identified via initial graph convolutional networks and graph attention networks (GCN and GAT) embeddings. This foundation enables stable early learning despite label skew. The Enact stage then addresses complicated aspects: (1) connections that require multiple steps, (2) edges that connect different types of nodes, and (3) nodes at the edges of minority classes by using adjustable attention weights. Finally, Embed consolidates these features via iterative message passing and curriculum-aligned loss weighting. We evaluate CL3AN-GNN on eight Open Graph Benchmark datasets spanning social, biological, and citation networks. Experiments show consistent improvements across all datasets in accuracy, F1-score, and AUC over recent state-of-the-art methods. The model's step-by-step method works well with different types of graph datasets, showing quicker results than training everything at once, better performance on new, imbalanced graphs, and clear explanations of each step using gradient stability and attention correlation learning curves. This work provides both a theoretically grounded framework for curriculum learning in GNNs and practical evidence of its effectiveness against imbalances, validated through metrics, convergence speeds, and generalisation tests.

Bridging Online and Offline RL: Contextual Bandit Learning for Multi-Turn Code Generation

Recently, there have been significant research interests in training large language models (LLMs) with reinforcement learning (RL) on real-world tasks, such as multi-turn code generation. While online RL tends to perform better than offline RL, its higher training cost and instability hinders wide adoption. In this paper, we build on the observation that multi-turn code generation can be formulated as a one-step recoverable Markov decision process and propose contextual bandit learning with offline trajectories (Cobalt), a new method that combines the benefits of online and offline RL. Cobalt first collects code generation trajectories using a reference LLM and divides them into partial trajectories as contextual prompts. Then, during online bandit learning, the LLM is trained to complete each partial trajectory prompt through single-step code generation. Cobalt outperforms two multi-turn online RL baselines based on GRPO and VeRPO, and substantially improves R1-Distill 8B and Qwen3 8B by up to 9.0 and 6.2 absolute Pass@1 scores on LiveCodeBench. Also, we analyze LLMs' in-context reward hacking behaviors and augment Cobalt training with perturbed trajectories to mitigate this issue. Overall, our results demonstrate Cobalt as a promising solution for iterative decision-making tasks like multi-turn code generation. Our code and data are available at https://github.com/OSU-NLP-Group/cobalt.

Do We Need Asynchronous SGD? On the Near-Optimality of Synchronous Methods

Modern distributed optimization methods mostly rely on traditional synchronous approaches, despite substantial recent progress in asynchronous optimization. We revisit Synchronous SGD and its robust variant, called $m$-Synchronous SGD, and theoretically show that they are nearly optimal in many heterogeneous computation scenarios, which is somewhat unexpected. We analyze the synchronous methods under random computation times and adversarial partial participation of workers, and prove that their time complexities are optimal in many practical regimes, up to logarithmic factors. While synchronous methods are not universal solutions and there exist tasks where asynchronous methods may be necessary, we show that they are sufficient for many modern heterogeneous computation scenarios.

AI Models

FutureMa/Eva-4B-V2


license: apache-2.0 language:

  • en library_name: transformers pipeline_tag: text-classification tags:
  • finance
  • earnings-calls
  • evasion-detection
  • nlp
  • qwen3 base_model: Qwen/Qwen3-4B-Instruct-2507 datasets:
  • FutureMa/EvasionBench

Eva-4B-V2

<p align="center"> <a href="https://huggingface.co/FutureMa/Eva-4B-V2"><img src="https://img.shields.io/badge/πŸ€—-Model-yellow?style=for-the-badge" alt="Model"></a> <a href="https://huggingface.co/datasets/FutureMa/EvasionBench"><img src="https://img.shields.io/badge/πŸ€—-Dataset-orange?style=for-the-badge" alt="Dataset"></a> <a href="https://github.com/IIIIQIIII/EvasionBench"><img src="https://img.shields.io/badge/GitHub-Repo-blue?style=for-the-badge" alt="GitHub"></a> <a href="https://iiiiqiiii.github.io/EvasionBench"><img src="https://img.shields.io/badge/Project-Page-green?style=for-the-badge" alt="Project Page"></a> </p> <p align="center"> <b>A 4B parameter model fine-tuned for detecting evasive answers in earnings call Q&A sessions.</b> </p>

Model Description

Performance

Eva-4B-V2 achieves 84.9% Macro-F1 on the EvasionBench evaluation set, outperforming frontier LLMs:

<p align="center"> <img src="top5_performance.svg" alt="Top 5 Model Performance" width="100%"> </p>

| Rank | Model | Macro-F1 | |------|-------|----------| | 1 | Eva-4B-V2 | 84.9% | | 2 | Gemini 3 Flash | 84.6% | | 3 | Claude Opus 4.5 | 84.4% | | 4 | GLM-4.7 | 82.9% | | 5 | GPT-5.2 | 80.9% |

Per-Class Performance

| Class | Precision | Recall | F1 | |-------|-----------|--------|-----| | Direct | 90.6% | 75.1% | 82.1% | | Intermediate | 73.7% | 87.7% | 80.1% | | Fully Evasive | 93.3% | 91.6% | 92.4% |

Label Definitions

| Label | Definition | |-------|------------| | direct | The core question is directly and explicitly answered | | intermediate | The response provides related context but sidesteps the specific core | | fully_evasive | The question is ignored, explicitly refused, or entirely off-topic |

Training

Two-Stage Training Pipeline

Qwen3-4B-Instruct-2507
        β”‚
        β–Ό Stage 1: 60K consensus data
        β”‚
Eva-4B-Consensus
        β”‚
        β–Ό Stage 2: 24K three-judge data
        β”‚
Eva-4B-V2

Training Configuration

| Parameter | Stage 1 | Stage 2 | |-----------|---------|---------| | Dataset | 60K consensus | 24K three-judge | | Epochs | 2 | 2 | | Learning Rate | 2e-5 | 2e-5 | | Batch Size | 32 | 32 | | Max Length | 2500 | 2048 | | Precision | bfloat16 | bfloat16 |

Hardware

  • Stage 1: 2x NVIDIA B200 (180GB SXM6)
  • Stage 2: 4x NVIDIA H100 (80GB SXM5)

Usage

With Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "FutureMa/Eva-4B-V2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")

# Prompt template
prompt = """You are a financial analyst. Your task is to Detect Evasive Answers in Financial Q&A

Question: What is the expected margin for Q4?
Answer: We expect it to be 32%.

Response format:
```json
{"label": "direct|intermediate|fully_evasive"}
```

Answer in ```json content, no other text"""

messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=64, temperature=0.1, do_sample=False)

generated = outputs[0][inputs["input_ids"].shape[1]:]
print(tokenizer.decode(generated, skip_special_tokens=True))
# Output: ```json
# {"label": "direct"}
# ```

With vLLM

from vllm import LLM, SamplingParams

llm = LLM(model="FutureMa/Eva-4B-V2")
sampling_params = SamplingParams(temperature=0, max_tokens=64)

outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)

Links

| Resource | URL | |----------|-----| | Dataset | FutureMa/EvasionBench | | GitHub | IIIIQIIII/EvasionBench |

Citation

@misc{eva4b2025,
  title={Eva-4B: A Fine-tuned Model for Evasion Detection in Earnings Calls},
  author={EvasionBench Team},
  year={2025},
  url={https://github.com/IIIIQIIII/EvasionBench}
}

License

Apache 2.0

Author: FutureMa

Likes: 26

Downloads: 0

Tags: transformers, safetensors, qwen3, text-generation, finance, earnings-calls, evasion-detection, nlp, text-classification, en, dataset:FutureMa/EvasionBench, base_model:Qwen/Qwen3-4B-Instruct-2507, base_model:finetune:Qwen/Qwen3-4B-Instruct-2507, license:apache-2.0, text-embeddings-inference, endpoints_compatible, region:us

lovedheart/Qwen3-Coder-Next-REAP-48B-A3B-GGUF


base_model:

  • Qwen/Qwen3-Coder-Next tags:
  • text-generation-inference license: apache-2.0

Qwen3-coder-next-reap

Qwen3-Coder-Next-REAP-48B-A3B has the following specifications:

  • Type: Causal Language Models
  • Number of Parameters: 48B in total and 3B activated
  • Hidden Dimension: 2048
  • Number of Layers: 48
  • Hybrid Layout: 12 * (3 * (Gated DeltaNet -> MoE) -> 1 * (Gated Attention -> MoE))
  • Gated Attention:
  • Number of Attention Heads: 16 for Q and 2 for KV
  • Head Dimension: 256
  • Rotary Position Embedding Dimension: 64
  • Gated DeltaNet:
    **Number of Linear Attention Heads: 32 for V and 16 for QK
    **Head Dimension: 128
  • Mixture of Experts:
  • **Number of Experts: 308 (uniformly pruned from 512)
  • **Number of Activated Experts: 10
  • **Number of Shared Experts: 1
  • Context Length: 262,144 natively and extensible up to 1,010,000 tokens
  • Compression Method: REAP (Router-weighted Expert Activation Pruning)
  • Compression Ratio: 40% expert pruning

Author: lovedheart

Likes: 14

Downloads: 0

Tags: gguf, text-generation-inference, base_model:Qwen/Qwen3-Coder-Next, base_model:quantized:Qwen/Qwen3-Coder-Next, license:apache-2.0, endpoints_compatible, region:us, conversational

GadflyII/Qwen3-Coder-Next-NVFP4


library_name: transformers base_model: Qwen/Qwen3-Coder-Next tags:

  • qwen3
  • moe
  • nvfp4
  • quantized
  • llmcompressor
  • vllm license: apache-2.0 pipeline_tag: text-generation

Note: If you have a multi-GPU SM120 Blackwell system (RTX 50/Pro), try my vLLM fork to resolve P2P / TP=2 issues (Pending PR into upstream).

https://github.com/Gadflyii/vllm/tree/main

Qwen3-Coder-Next-NVFP4

NVFP4 quantized version of Qwen/Qwen3-Coder-Next (80B-A3B).

Model Details

| Property | Value | |----------|-------| | Base Model | Qwen/Qwen3-Coder-Next | | Architecture | Qwen3NextForCausalLM (Hybrid DeltaNet + Attention + MoE) | | Parameters | 80B total, 3B activated per token | | Experts | 512 total, 10 activated + 1 shared | | Layers | 48 | | Context Length | 262,144 tokens (256K) | | Quantization | NVFP4 (FP4 weights + FP4 activations) | | Size | 45GB (down from ~149GB BF16, 70% reduction) | | Format | compressed-tensors |

Quantization Details

Quantized using llmcompressor 0.9.0.1.

NUM_CALIBRATION_SAMPLES = 20
MAX_SEQUENCE_LENGTH = 2048
DATASET = "HuggingFaceH4/ultrachat_200k" (train_sft)
moe_calibrate_all_experts = True

# Layers kept in BF16
ignore = [
    "lm_head",
    "re:.*mlp.gate$",               # MoE router gates
    "re:.*mlp.shared_expert_gate$", # Shared expert gates
    "re:.*linear_attn.*",           # DeltaNet linear attention
]

Benchmark Results

MMLU-Pro

| Model | Accuracy | Delta | |-------|----------|-------| | BF16 | 52.90% | - | | NVFP4 | 51.27% | -1.63% |

Context Length Testing

Successfully tested up to 128K tokens with FP8 KV cache (Not enough VRAM to test any higher context).

Usage with vLLM

Requires vLLM with NVFP4 support (0.16.0+), Transformers 5.0.0+

#vllm Serving
vllm serve GadflyII/Qwen3-Coder-Next-NVFP4 \
    --tensor-parallel-size 2 \
    --max-model-len 131072 \
    --kv-cache-dtype fp8

License

Apache 2.0 (same as base model)

Acknowledgments

Author: GadflyII

Likes: 4

Downloads: 125

Tags: transformers, safetensors, qwen3_next, text-generation, qwen3, moe, nvfp4, quantized, llmcompressor, vllm, conversational, base_model:Qwen/Qwen3-Coder-Next, base_model:quantized:Qwen/Qwen3-Coder-Next, license:apache-2.0, endpoints_compatible, compressed-tensors, region:us

BennyDaBall/Qwen3-4b-Z-Image-Engineer-V4


license: apache-2.0 language:

  • en library_name: transformers tags:
  • text-generation
  • prompt-engineering
  • image-generation
  • qwen3
  • gguf pipeline_tag: text-generation

πŸš€ Z-Engineer V4 (4B)

The Z-Engineer returns β€” now with a PhD in "not being mid."

This is Z-Engineer V4, the culmination of extensive research into what makes an AI prompt engineer actually good at its job. Built on the Qwen 3 architecture and trained using a novel SMART Training methodology, this 4B parameter model doesn't just describe scenesβ€”it understands the craft of visual storytelling down to the lens flare.


🧠 What is this?

Z-Engineer V4 is a fully fine-tuned (not LoRA, we went all in) version of the text encoder from Tongyi-MAI/Z-Image-Turbo. It's been specifically trained to understand the nuances of AI Image Generation workflows.

It excels at:

  • Expanding Concepts: Turn "sad robot in rain" into a cinematic fever dream with chromatic aberration, shallow depth of field, and a melancholic color grade that would make Blade Runner jealous.
  • Technical Precision: It knows the difference between an 85mm portrait lens and a 24mm wideβ€”and will use them appropriately. Lighting? Rembrandt, split, volumetric fog? It's got opinions.
  • Stylistic Consistency: It writes with a creative voice, not that robotic "hyperrealistic, 8k, trending on artstation" energy.

πŸ”‘ Key Use Cases

  • ✨ Prompt Enhancement: A low-VRAM powerhouse for turning your braindead 3AM ideas into detailed visual narratives.
  • πŸ”Œ Z-Image Turbo Encoder: Fully backwards compatible as a drop-in CLIP text encoder for Z-Image Turbo workflows producing varied and unique results from the same seed.
  • πŸ›‘οΈ Local & Private: Runs entirely on your machine. No API fees, no data logging, no corporate overlords judging your prompts.
  • ⚑ Hybrid Power: Use it to expand a prompt, then use the model itself as the encoder for generation. It's turtles all the way down.

🧬 What's New in V4: SMART Training

This version introduces SMART Training (Smart Mode with Adaptive Regularization Topologer)β€”a custom training methodology that goes beyond standard cross-entropy optimization.

The secret sauce? Four auxiliary regularizers that operate on hidden states, logits, and weight matrices:

| Regularizer | What It Does | Why It Matters | |-------------|--------------|----------------| | Entropic | Prevents mode collapse, encourages diversity | No more repetitive "cinematic, 8k, masterpiece" loops | | Holographic | Enforces depth-wise information compression | Clean feature hierarchy from surface to abstract | | Topological | Encourages coherent latent trajectories | Prompts flow logically instead of word salad | | Manifold | Stabilizes weight distributions | Rock-solid training dynamics |

The result? A model that generalizes better, outputs more varied responses, and doesn't collapse into repetitive patterns even after 55,000 training examples.


πŸ“‰ Key Improvements Over V2.5

  • Full Fine-Tune: V2.5 was a merged LoRA. V4 is a full parameter fine-tuneβ€”every single weight has been updated.
  • Bigger Dataset: Trained on 55,000 examples (vs 34,678 for V2.5)β€”60% more data.
  • SMART Regularization: Novel training methodology that actively prevents the failure modes that plagued earlier versions.
  • Longer Training: 7,500+ optimizer steps with extensive validation checkpointing.
  • Loss Reduction: 55% decrease in validation loss (2.80 β†’ 1.27) compared to baseline.

πŸ”Œ ComfyUI Integration (Recommended)

I have a custom node for seamless integration with ComfyUI:

  • Features: Optimized for local OpenAI API compatible backends (LM Studio, Ollama, etc.)
  • Get it here: ComfyUI-Z-Engineer

πŸ“ Recommended System Prompt

For best results, use this system prompt:

Interpret the user seed as production intent, then build a definitive 200-250 word single-paragraph image prompt that preserves every explicit constraint while intelligently expanding missing details. First infer the core subject, action, setting, and emotional tone; treat these as non-negotiable anchors. Then enhance with precise visual staging (explicit foreground, midground, background), clear visual hierarchy and eye path, physically plausible lighting (source, direction, softness, color temperature), and optical strategy (if lens/aperture are provided, preserve exactly; if absent, choose fitting lens and aperture and imply their depth-of-field effect). Integrate organic, manufactured, and environmental textures with realistic material behavior, add motion/atmospheric cues only when they support the scene, and apply a coherent color grade consistent with mood and environment. Keep the prose vivid but controlled: no contradictions, no overstuffing, no generic filler. Do not mention camera body brands. Output one polished paragraph only, no bullets, no line breaks, no meta commentary.


πŸ’» Training Facts

I believe in open science. Here's exactly how this was built:

Hardware:

  • Trained locally on an AMD Strix Halo system (Ryzen AI Max+ 395, 128GB Unified RAM)
  • AMD Radeon 8060S Graphics (ROCm/HIP)

Dataset:

  • Size: 55,000 high-quality examples
  • 25,000 Vision-Grounded Samples: Real professional photographs transcribed into the training format using Qwen3-VL-30B-A3Bβ€”teaching the model what actually good cinematography looks like
  • 30,000 Synthetic Samples: Generated prompt enhancement pairs for diverse concept coverage
  • Content: Curated mix teaching the model to extrapolate seed concepts into cinematic prompts grounded in real photographic technique

Training Configuration: | Parameter | Value | |-----------|-------| | Method | Full Fine-Tune (not LoRA) | | Base Model | Qwen3-4b-Z-Image-Turbo-AbliteratedV1 | | Optimizer Steps | 7,500+ | | Batch Size | 2 Γ— 8 accumulation = 16 effective | | Learning Rate | 1e-5 (cosine decay with 5% warmup) | | Precision | BFloat16 | | Sequence Length | 640 tokens | | Total Training Time | ~90 hours |


πŸ“¦ GGUF & Quantization

I provide a full suite of GGUF quantizations for use with llama.cpp, Ollama, and LM Studio:

| Quantization | Size | Notes | |--------------|------|-------| | F16 | 8.0 GB | Full precision, maximum quality | | Q8_0 | 4.3 GB | Near-lossless, recommended for most users | | Q6_K | 3.3 GB | Great balance of quality and size | | Q5_K_M | 2.9 GB | Good quality, smaller footprint | | Q5_K_S | 2.8 GB | Slightly smaller Q5 variant | | Q4_K_M | 2.5 GB | Solid 4-bit, good for VRAM-limited setups | | Q4_K_S | 2.4 GB | Smaller 4-bit variant | | Q3_K_L | 2.2 GB | Lower quality 3-bit, for the desperate | | Q3_K_M | 2.1 GB | Medium 3-bit | | Q2_K | 1.7 GB | Emergency-only tier. But it exists! |


🎯 Quick Start

With Ollama:

ollama run BennyDaBall/Qwen3-4b-Z-Image-Engineer-V4

With LM Studio:

  1. Download the GGUF of your choice
  2. Load it in LM Studio
  3. Use the ComfyUI node or chat directly

⚠️ Disclaimer

This model generates text for image prompts. While I have filtered the dataset to the best of my ability, users should exercise their own judgment. I am not responsible for the content you generate.

Also, if you use this to generate prompts for images that get you in trouble, that's a you problem. The model is just vibing.


πŸ™ Acknowledgements

  • Qwen Team for the excellent base architecture
  • Tongyi-MAI for Z-Image-Turbo
  • The open source AI community for making this kind of work possible
  • My electricity bill, which now classifies me as a small industrial facility

Built with ❀️ and way too much GPU time by BennyDaBall

Author: BennyDaBall

Likes: 3

Downloads: 0

Tags: transformers, safetensors, gguf, text-generation, prompt-engineering, image-generation, qwen3, en, license:apache-2.0, endpoints_compatible, region:us, conversational

lovedheart/Qwen3-Coder-Next-REAP-40B-A3B-GGUF


base_model:

  • Qwen/Qwen3-Coder-Next tags:
  • text-generation-inference license: apache-2.0

Qwen3-coder-next-reap

Qwen3-Coder-Next-REAP-40B-A3B has the following specifications:

  • Type: Causal Language Models
  • Number of Parameters: 40B in total and 3B activated
  • Hidden Dimension: 2048
  • Number of Layers: 48
  • Hybrid Layout: 12 * (3 * (Gated DeltaNet -> MoE) -> 1 * (Gated Attention -> MoE))
  • Gated Attention:
  • Number of Attention Heads: 16 for Q and 2 for KV
  • Head Dimension: 256
  • Rotary Position Embedding Dimension: 64
  • Gated DeltaNet:
    **Number of Linear Attention Heads: 32 for V and 16 for QK
    **Head Dimension: 128
  • Mixture of Experts:
  • **Number of Experts: 256 (uniformly pruned from 512)
  • **Number of Activated Experts: 10
  • **Number of Shared Experts: 1
  • Context Length: 262,144 natively and extensible up to 1,010,000 tokens
  • Compression Method: REAP (Router-weighted Expert Activation Pruning)
  • Compression Ratio: 50% expert pruning

Test video (writing tiny video game @ Q4_K_XL ): https://bilibili.com/video/BV1k6fZB7EQz/

Author: lovedheart

Likes: 3

Downloads: 0

Tags: gguf, text-generation-inference, base_model:Qwen/Qwen3-Coder-Next, base_model:quantized:Qwen/Qwen3-Coder-Next, license:apache-2.0, endpoints_compatible, region:us, conversational

mingyi456/Ace-Step1.5-DF11-ComfyUI


license: mit language:

  • en
  • zh pipeline_tag: text-to-image tags:
  • comfyui
  • diffusion-single-file base_model:
  • ACE-Step/Ace-Step1.5 base_model_relation: quantized

Author: mingyi456

Likes: 3

Downloads: 0

Tags: diffusion-single-file, comfyui, text-to-image, en, zh, base_model:ACE-Step/Ace-Step1.5, base_model:quantized:ACE-Step/Ace-Step1.5, license:mit, region:us

BennyDaBall/LFM2.5-1.2B-Z-Image-Engineer-V4


license: apache-2.0 language:

  • en library_name: transformers tags:
  • text-generation
  • prompt-engineering
  • image-generation
  • lfm
  • liquid
  • gguf pipeline_tag: text-generation

πŸš€ LFM2.5-1.2B-Z-Image-Engineer-V4

The Z-Engineer goes liquidβ€”smaller, faster, and ready to drink.

This is Z-Engineer V4 built on Liquid Foundation Model 2.5 (LFM2.5)β€”a 1.2B parameter model that punches way above its weight class. Perfect for batch workflows where you need prompt engineering at warp speed.


🧠 What is this?

LFM2.5-1.2B-Z-Image-Engineer-V4 is a fully fine-tuned version of LiquidAI/LFM2.5-1.2B-Base. It's been specifically trained to understand the nuances of AI Image Generation workflows.

It excels at:

  • Expanding Concepts: Turn "neon samurai" into a full cinematic sequence with lighting, lens choices, and atmosphere.
  • Technical Precision: Understands camera terminology, lighting setups, and film aesthetics.
  • Blazing Speed: At 1.2B parameters, it's ~3x faster than the Qwen3-4B version while maintaining quality.

πŸ”‘ Key Use Cases

  • ⚑ High-Throughput Workflows: When you need to expand hundreds or thousands of prompts, LFM2.5's speed shines.
  • πŸ’Ύ Low VRAM Deployments: Runs comfortably on minimal hardwareβ€”perfect for embedded or edge use cases.
  • πŸ›‘οΈ Local & Private: Runs entirely on your machine. No API fees, no data logging.
  • πŸ”Œ ComfyUI Ready: Works with the same ComfyUI-Z-Engineer node as the Qwen3 version.

🧬 SMART Training: Adapted for LFM2.5's Hybrid Architecture

This version uses SMART Training (Smart Mode with Adaptive Regularization Topologer)β€”the same methodology used for Qwen3-4B-Z-Engineer-V4, but adapted for LFM2.5's unique hybrid architecture.

LFM2.5's Challenge: Unlike traditional transformers, LFM2.5 uses a hybrid architecture mixing attention layers with recurrent (liquid) layers. The standard SMART regularizers needed significant adaptation:

| Adaptation | What Changed | Why | |------------|--------------|-----| | Attention-Only Filtering | Regularizers only process attention layer outputs, skipping recurrent layers | Recurrent layer hidden states have different statistical properties | | Layer Pooling | Last 4 attention layers are mean-pooled for topology regularization | Provides stable representation despite sparser attention placement | | Reduced Regularizer Weights | Entropic: 0.003, Holographic: 0.01, Topology: 0.02/0.02 | LFM2.5's smaller capacity needs gentler regularization | | Superfluid-Inspired Damping | "SmartGate" auto-reduces aux loss contribution on gradient instability | Prevents training collapse when hybrid layers produce non-finite gradients |

The result? Stable training on a fundamentally different architecture while still benefiting from diversity, coherence, and depth regularization.


πŸ“‰ Why Choose LFM2.5 Over Qwen3-4B?

| Aspect | LFM2.5-1.2B | Qwen3-4B | |--------|-------------|----------| | Parameters | 1.2B | 4B | | Speed | ~3x faster | Baseline | | VRAM | ~1-2 GB (Q4) | ~2.5 GB (Q4) | | Quality | Good for most use cases | Highest quality | | Best For | Batch processing, edge deployment, speed-critical workflows | Maximum quality, complex scenes |

Choose LFM2.5 when: You're processing large batches, running on limited hardware, or speed matters more than marginal quality gains.

Choose Qwen3-4B when: You want the absolute best quality and can afford the extra compute.


πŸ”Œ ComfyUI Integration

Works with the same custom node as the Qwen3 version:


πŸ“ Recommended System Prompt

For best results, use this system prompt:

Interpret the user seed as production intent, then build a definitive 200-250 word single-paragraph image prompt that preserves every explicit constraint while intelligently expanding missing details. First infer the core subject, action, setting, and emotional tone; treat these as non-negotiable anchors. Then enhance with precise visual staging (explicit foreground, midground, background), clear visual hierarchy and eye path, physically plausible lighting (source, direction, softness, color temperature), and optical strategy (if lens/aperture are provided, preserve exactly; if absent, choose fitting lens and aperture and imply their depth-of-field effect). Integrate organic, manufactured, and environmental textures with realistic material behavior, add motion/atmospheric cues only when they support the scene, and apply a coherent color grade consistent with mood and environment. Keep the prose vivid but controlled: no contradictions, no overstuffing, no generic filler. Do not mention camera body brands. Output one polished paragraph only, no bullets, no line breaks, no meta commentary.


πŸ’» Training Facts

I believe in open science. Here's exactly how this was built:

Hardware:

  • Trained locally on an AMD Strix Halo system (Ryzen AI Max+ 395, 128GB Unified RAM)
  • AMD Radeon 8060S Graphics (ROCm/HIP)

Dataset:

  • Size: 55,000 high-quality examples (same dataset as Qwen3-4B version)
  • 25,000 Vision-Grounded Samples: Real professional photographs transcribed using Qwen3-VL-30B-A3B
  • 30,000 Synthetic Samples: Generated prompt enhancement pairs

Training Configuration: | Parameter | Value | |-----------|-------| | Method | Full Fine-Tune (not LoRA) | | Base Model | LiquidAI/LFM2.5-1.2B-Base | | Optimizer Steps | 3,500 | | Batch Size | 8 Γ— 3 accumulation = 24 effective | | Learning Rate | 5e-6 (cosine decay with 5% warmup) | | Precision | BFloat16 | | Sequence Length | 640 tokens |


πŸ“¦ GGUF & Quantization

I provide a full suite of GGUF quantizations for use with llama.cpp, Ollama, and LM Studio:

| Quantization | Size | Notes | |--------------|------|-------| | F16 | 2.2 GB | Full precision, maximum quality | | Q8_0 | 1.2 GB | Near-lossless, recommended | | Q6_K | 918 MB | Great balance | | Q5_K_M | 804 MB | Good quality | | Q5_K_S | 787 MB | Slightly smaller | | Q4_K_M | 697 MB | Solid 4-bit | | Q4_K_S | 668 MB | Smaller 4-bit | | Q3_K_L | 606 MB | Lower quality | | Q3_K_M | 573 MB | Medium 3-bit |


🎯 Quick Start

With LM Studio:

  1. Download the GGUF of your choice
  2. Load it in LM Studio
  3. Use the ComfyUI node or chat directly

⚠️ Disclaimer

This model generates text for image prompts. While I have filtered the dataset to the best of my ability, users should exercise their own judgment. I am not responsible for the content you generate.


πŸ™ Acknowledgements

  • LiquidAI for the excellent LFM2.5 architecture
  • Qwen Team for the VL model used in dataset creation
  • The open source AI community for making this kind of work possible

Built with ❀️ and liquid courage by BennyDaBall

Author: BennyDaBall

Likes: 2

Downloads: 0

Tags: transformers, safetensors, gguf, text-generation, prompt-engineering, image-generation, lfm, liquid, en, license:apache-2.0, endpoints_compatible, region:us, conversational

salakash/Minimalism


language:

  • en license: apache-2.0 base_model: Qwen/Qwen2.5-Coder-0.5B-Instruct tags:
  • code
  • coding-assistant
  • lora
  • mlx
  • apple-silicon
  • qwen2.5 datasets:
  • flwrlabs/code-alpaca-20k
  • m-a-p/Code-Feedback library_name: mlx-lm pipeline_tag: text-generation

Developed By Samiya Kashif, Kashif Salahuddin & Rohan Bhangale & Robert Rojek

1. Executive Summary

Minimalism is a specialized coding assistant built as a LoRA (Low-Rank Adaptation) adapter for the Qwen2.5-Coder-0.5B-Instruct base model. Unlike generic coding assistants, Minimalism implements a "runnable-first" philosophy: when users request code, responses are structured with clear Solution, Usage, and Sanity test sections, ensuring developers receive immediately executable code with minimal friction.

What Minimalism Is

  • A LoRA adapter Trained on code-alpaca-20k dataset
  • OpenAI-compatible API for local inference
  • Lightweight distribution (~12MB adapter vs. multi-GB full models)
  • Production-engineered with automated pipelines, evaluation, and publishing

Why Minimalism

Minimalism is built for a simple, practical goal: deliver the same outcome with fewer lines of code.

Most coding assistants tend to β€œover-achieve” by producing large, multi-step solutionsβ€”even when a smaller, clearer implementation would do. That extra code isn’t free: it increases review effort, maintenance cost, and the surface area where defects can hide.

Too Much Code, Too Fast Teams everywhere are seeing a huge jump in the number of lines of code (LOC). Developersβ€”from interns to seniorsβ€”are suddenly writing 5 to 7 times more than before. At first, it looks like higher productivity. In reality, it often means more bugs.

There’s a long-standing rule in software engineering:

β€œThe more lines of code you have, the higher your probability of introducing bugs.”

The industry’s oldest truth still stands: the more code you have, the more things can go wrong. And AI-generated code tends to be verbose and repetitive, which can inflate LOC without adding real value.

Minimalism is designed for teams that value minimalism, clarity, and correctness over volume.

What makes Minimalism different

  • Minimal LoC by default Minimalism is optimized to minimize lines of code while preserving behaviorβ€”it prefers the smallest correct solution that meets the user’s objective.

  • Internal governance behavior The model follows a lightweight internal β€œgovernance layer” in its response style: avoid unnecessary scaffolding, avoid over-abstraction, keep code focused, and don’t introduce additional complexity that doesn’t improve the result. The governance layer sits between the user request and the model’s final output to enforce minimalism as a constraint. It evaluates candidate solutions by measuring lines of code and selects the smallest implementation that still satisfies the original requirements. If a shorter variant fails, it automatically falls back to the next-smallest passing candidate, ensuring fewer lines without sacrificing correctness.

  • Practical, runnable output When you ask for code, Minimalism is tuned toward β€œrunnable-first” answersβ€”clear implementation, a minimal usage example, and a quick sanity check when appropriate.

Early validation

Minimalism was evaluated in a small developer study comparing it with popular coding models on a shared set of tasks. In this pilot, Minimalism showed a clear reduction in lines of code (up to ~30%) while producing solutions that executed correctly and achieved the same intended outcomes under the evaluation harness.

Note: Results depend on task selection, constraints, and how β€œequivalence” is measured. We recommend validating on your own codebase and standards.

Why It Exists

Developers need coding assistance that:

  1. Provides runnable code immediately without extensive explanation
  2. Runs locally without cloud dependencies
  3. Maintains small footprint for fast iteration
  4. Offers structured, predictable responses for automation

Who It's For

  • Individual developers working on their individual projects.
  • Small teams needing local, private coding assistance
  • Educators teaching programming with consistent code examples
  • Researchers experimenting with LoRA fine-tuning on MLX

System Architecture

High-Level Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     Minimalism System                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚   Data       β”‚      β”‚   Training   β”‚      β”‚  Serving β”‚ β”‚
β”‚  β”‚  Pipeline    │─────▢│   Pipeline   │─────▢│  Layer   β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚         β”‚                      β”‚                     β”‚      β”‚
β”‚         β–Ό                      β–Ό                     β–Ό      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚  Dataset     β”‚      β”‚  LoRA        β”‚      β”‚  MLX     β”‚ β”‚
β”‚  β”‚  Processing  β”‚      β”‚  Adapter     β”‚      β”‚  Server  β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚                                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Technical Architecture

Method 1 Pipeline (9 Steps)

1. Receive Request
   ↓
2. Derive Requirements + Tests
   ↓
3. Generate N Candidates
   ↓
4. Normalize Code
   ↓
5. Score by LoC
   ↓
6. Apply Quality Gates (G1-G5)
   ↓
7. Select Minimal Passing
   ↓
8. Optional Reduction Loop
   ↓
9. Output + Audit

Quality Gates

  • G1 Compile: Python syntax validation
  • G2 Constraints: Dependency checking
  • G3 Execution: Sandbox smoke test (2s timeout)
  • G4 Tests: Acceptance test validation
  • G5 Safety: Dangerous operation detection

image

Key Design Principles

  1. Text-based analysis (no AST as required)
  2. Fail-fast validation (stop on first gate failure)
  3. Sandbox isolation (subprocess with timeout)
  4. Complete audit trail (every decision logged)
  5. Pluggable architecture (easy to extend)

Quick Start

Option 1: Use with MLX

Install MLX and load the model with adapter:

pip install mlx-lm
from mlx_lm import load, generate

# Load base model with Minimalism adapter
model, tokenizer = load(
    "mlx-community/Qwen2.5-Coder-0.5B-Instruct-4bit",
    adapter_path="salakash/Minimalism"
)

# Generate code
prompt = "Write a Python function to calculate factorial"
response = generate(model, tokenizer, prompt=prompt, max_tokens=512)
print(response)

Option 2: Use with Transformers

pip install transformers torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-Coder-0.5B-Instruct",
    trust_remote_code=True
)

# Load adapter
model = PeftModel.from_pretrained(base_model, "salakash/Minimalism")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-0.5B-Instruct")

# Generate
messages = [{"role": "user", "content": "Write a Python function to add two numbers"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Option 3: Web UI with MLX

Start an OpenAI-compatible server:

# Install mlx-lm if not already installed
pip install mlx-lm

# Start server with adapter
mlx_lm.server \
  --model mlx-community/Qwen2.5-Coder-0.5B-Instruct-4bit \
  --adapter-path salakash/Minimalism \
  --port 8080

Then use with any OpenAI-compatible client:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen2.5-Coder-0.5B-Instruct-4bit",
    "messages": [
      {"role": "user", "content": "Write a Python function to reverse a string"}
    ],
    "max_tokens": 512
  }'

Or use with any OpenAI-compatible web UI like:

Configure the UI to point to http://localhost:8080 as the API endpoint.

Option 4: Hugging Face Inference API

Use directly via Hugging Face's Inference API (requires HF token):

import requests

API_URL = "https://api-inference.huggingface.co/models/salakash/Minimalism"
headers = {"Authorization": "Bearer YOUR_HF_TOKEN"}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

output = query({
    "inputs": "Write a Python function to check if a number is prime",
    "parameters": {"max_new_tokens": 256}
})
print(output)

Response Format

Minimalism provides structured, runnable-first responses:

  • Solution: The main implementation code
  • Usage: A minimal runnable example
  • Sanity test: A tiny test snippet (when appropriate)

Comparison

Minimalism achieved the same objective in ~8-10 lines of code, while a standard LLM typically produced 22–26 lines for the equivalent solution.

Minimalism

alt text

Standard Coding Agent

alt text

Documentation

For comprehensive technical details, see:

Base Model & Dataset

License

This project publishes only adapter artifacts and configuration. The base model and dataset have their own licenses:

  • Base Model: Apache-2.0 (Qwen/Qwen2.5-Coder-0.5B-Instruct)
  • Dataset: Apache-2.0 (flwrlabs/code-alpaca-20k)

See LICENSE-THIRD-PARTY.md for complete attribution.

Acknowledgments

  • Qwen team for the excellent base model
  • MLX community for the Apple Silicon optimizations
  • flwrlabs for the code-alpaca-20k dataset
  • Multimodel Art Projection for m-a-p/Code-Feedback

Author: salakash

Likes: 2

Downloads: 0

Tags: mlx-lm, qwen2, code, coding-assistant, lora, mlx, apple-silicon, qwen2.5, text-generation, en, dataset:flwrlabs/code-alpaca-20k, dataset:m-a-p/Code-Feedback, base_model:Qwen/Qwen2.5-Coder-0.5B-Instruct, base_model:adapter:Qwen/Qwen2.5-Coder-0.5B-Instruct, license:apache-2.0, region:us

TeichAI/Devstral-Small-2505-Deepseek-V3.2-Speciale-Distill-GGUF


base_model: TeichAI/Devstral-Small-2505-Deepseek-V3.2-Speciale-Distill tags:

  • llama.cpp
  • gguf
  • unsloth language:
  • en datasets:
  • TeichAI/deepseek-v3.2-speciale-OpenCodeReasoning-3k
  • TeichAI/deepseek-v3.2-speciale-1000x
  • TeichAI/deepseek-v3.2-speciale-openr1-math-3k

Devstral Small 2505 - Deepseek v3.2 Speciale Distill

This model was trained on a non-reasoning (reasoning traces were removed) dataset of DeepSeek v3.2 Speciale.

  • 🧬 Datasets:

    • TeichAI/deepseek-v3.2-speciale-OpenCodeReasoning-3k
    • TeichAI/deepseek-v3.2-speciale-1000x
    • TeichAI/deepseek-v3.2-speciale-openr1-math-3k
  • πŸ— Base Model:

    • unsloth/Devstral-Small-2505
  • ⚑ Use cases:

    • Coding
    • Math
    • Chat
    • Deep Research

Author: TeichAI

Likes: 2

Downloads: 0

Tags: gguf, llama.cpp, unsloth, en, dataset:TeichAI/deepseek-v3.2-speciale-OpenCodeReasoning-3k, dataset:TeichAI/deepseek-v3.2-speciale-1000x, dataset:TeichAI/deepseek-v3.2-speciale-openr1-math-3k, base_model:TeichAI/Devstral-Small-2505-Deepseek-V3.2-Speciale-Distill, base_model:quantized:TeichAI/Devstral-Small-2505-Deepseek-V3.2-Speciale-Distill, endpoints_compatible, region:us, conversational

AliceThirty/Step-3.5-Flash-gguf


tags:

  • gguf base_model: stepfun-ai/Step-3.5-Flash

Author: AliceThirty

Likes: 2

Downloads: 0

Tags: gguf, base_model:stepfun-ai/Step-3.5-Flash, base_model:quantized:stepfun-ai/Step-3.5-Flash, endpoints_compatible, region:us, conversational