Todays AI Summary

AI Developments: Prompt Engineering, View Synthesis, and 3D Generation Take Center Stage

Today's AI landscape showcases advancements across several key areas, including prompt engineering for image generation, monocular view synthesis, and reinforcement learning for 3D content creation. New models and research papers introduce innovative approaches to improve the quality, control, and efficiency of AI-driven tasks.

Research Highlights

Several interesting research papers have emerged:

  • SceneMaker: This paper introduces a decoupled 3D scene generation framework that addresses challenges in open-set de-occlusion and pose estimation. By decoupling the de-occlusion model and enhancing it with diverse datasets, SceneMaker achieves higher quality geometry and more accurate poses in complex scenes. The project releases code and datasets for further research.
  • Sharp Monocular View Synthesis in Less Than a Second: This paper introduces SHARP, an approach to photorealistic view synthesis from a single image. SHARP regresses the parameters of a 3D Gaussian representation of the depicted scene in less than a second on a standard GPU via a single feedforward pass through a neural network. Experimental results demonstrate that SHARP delivers robust zero-shot generalization across datasets.
  • AlcheMinT: This paper presents a framework for fine-grained temporal control in subject-driven video generation. AlcheMinT introduces explicit timestamps conditioning, enabling precise control over subject appearance and disappearance in videos, which is crucial for applications like storyboarding and controllable animation.
  • OmniView: This paper introduces a unified framework that generalizes across a wide range of 4D consistency tasks. OmniView separately represents space, time, and view conditions, enabling flexible combinations of these inputs. For example, OmniView can synthesize novel views from static, dynamic, and multiview inputs, extrapolate trajectories forward and backward in time, and create videos from text or image prompts with full camera control.
  • Mull-Tokens: This paper introduces modality-agnostic latent tokens pre-trained to hold intermediate information in either image or text modalities to let the model think free-form towards the correct answer. Across four challenging spatial reasoning benchmarks involving tasks such as solving puzzles and taking different perspectives, Mull-Tokens improve upon several baselines utilizing text-only reasoning or interleaved image-text reasoning.
  • Are We Ready for RL in Text-to-3D Generation?: This paper conducts a systematic study of applying reinforcement learning (RL) to text-to-3D autoregressive generation. It explores reward designs, RL algorithms, and introduces a new benchmark (MME-3DR) to measure reasoning abilities in 3D generation models. The research culminates in AR3D-R1, an RL-enhanced text-to-3D model.
  • ImplicitRDP: This paper proposes a unified end-to-end visual-force diffusion policy that integrates visual planning and reactive force control within a single network. It introduces Structural Slow-Fast Learning, a mechanism utilizing causal attention to simultaneously process asynchronous visual and force tokens, allowing the policy to perform closed-loop adjustments at the force frequency while maintaining the temporal coherence of action chunks.
  • Stronger Normalization-Free Transformers: This paper introduces Derf, a novel point-wise function that outperforms LayerNorm, RMSNorm, and DyT across a wide range of domains, including vision (image recognition and generation), speech representation, and DNA sequence modeling.

Model Releases

Several new models have been released, showcasing diverse applications:

  • BennyDaBall/qwen3-4b-Z-Image-Engineer: This model focuses on automated prompt engineering for Z-Image Turbo, an architecture for image generation. It's fine-tuned to understand the specific requirements of Z-Image Turbo, generating detailed and effective prompts for high-quality image synthesis.
  • apple/Sharp: This model accompanies the research paper Sharp Monocular View Synthesis in Less Than a Second. Given a single photograph, SHARP regresses the parameters of a 3D Gaussian representation of the depicted scene. This is done in less than a second on a standard GPU via a single feedforward pass through a neural network.
  • allenai/Olmo-3.1-7B-RL-Zero-Math: This model is part of the Olmo 3 family, focusing on improving reasoning tasks like math. It's trained using reinforcement learning from verifiable rewards (RLVR) on a math-specific dataset.
  • DrGM/DrGM-ConvNeXt-V2L-Facial-Emotion-Recognition: This model is a facial emotion recognition model based on the ConvNeXt V2 Large architecture. It achieves high accuracy in recognizing seven distinct facial emotions.
  • Lakonik/pi-FLUX.2: This model

AI Papers for 2026-02-08

Shared LoRA Subspaces for almost Strict Continual Learning

Adapting large pretrained models to new tasks efficiently and continually is crucial for real-world deployment but remains challenging due to catastrophic forgetting and the high cost of retraining. While parameter-efficient tuning methods like low rank adaptation (LoRA) reduce computational demands, they lack mechanisms for strict continual learning and knowledge integration, without relying on data replay, or multiple adapters. We propose Share, a novel approach to parameter efficient continual finetuning that learns and dynamically updates a single, shared low-rank subspace, enabling seamless adaptation across multiple tasks and modalities. Share constructs a foundational subspace that extracts core knowledge from past tasks and incrementally integrates new information by identifying essential subspace directions. Knowledge from each new task is incorporated into this evolving subspace, facilitating forward knowledge transfer, while minimizing catastrophic interference. This approach achieves up to 100x parameter reduction and 281x memory savings over traditional LoRA methods, maintaining performance comparable to jointly trained models. A single Share model can replace hundreds of task-specific LoRA adapters, supporting scalable, asynchronous continual learning. Experiments across image classification, natural language understanding, 3D pose estimation, and text-to-image generation validate its effectiveness, making Share a practical and scalable solution for lifelong learning in large-scale AI systems.

DyTopo: Dynamic Topology Routing for Multi-Agent Reasoning via Semantic Matching

Multi-agent systems built from prompted large language models can improve multi-round reasoning, yet most existing pipelines rely on fixed, trajectory-wide communication patterns that are poorly matched to the stage-dependent needs of iterative problem solving. We introduce DyTopo, a manager-guided multi-agent framework that reconstructs a sparse directed communication graph at each round. Conditioned on the manager's round goal, each agent outputs lightweight natural-language query (need) and \key (offer) descriptors; DyTopo embeds these descriptors and performs semantic matching, routing private messages only along the induced edges. Across code generation and mathematical reasoning benchmarks and four LLM backbones, DyTopo consistently outperforms over the strongest baseline (avg. +6.2). Beyond accuracy, DyTopo yields an interpretable coordination trace via the evolving graphs, enabling qualitative inspection of how communication pathways reconfigure across rounds.

CommCP: Efficient Multi-Agent Coordination via LLM-Based Communication with Conformal Prediction

To complete assignments provided by humans in natural language, robots must interpret commands, generate and answer relevant questions for scene understanding, and manipulate target objects. Real-world deployments often require multiple heterogeneous robots with different manipulation capabilities to handle different assignments cooperatively. Beyond the need for specialized manipulation skills, effective information gathering is important in completing these assignments. To address this component of the problem, we formalize the information-gathering process in a fully cooperative setting as an underexplored multi-agent multi-task Embodied Question Answering (MM-EQA) problem, which is a novel extension of canonical Embodied Question Answering (EQA), where effective communication is crucial for coordinating efforts without redundancy. To address this problem, we propose CommCP, a novel LLM-based decentralized communication framework designed for MM-EQA. Our framework employs conformal prediction to calibrate the generated messages, thereby minimizing receiver distractions and enhancing communication reliability. To evaluate our framework, we introduce an MM-EQA benchmark featuring diverse, photo-realistic household scenarios with embodied questions. Experimental results demonstrate that CommCP significantly enhances the task success rate and exploration efficiency over baselines. The experiment videos, code, and dataset are available on our project website: https://comm-cp.github.io.

Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory

Memory is increasingly central to Large Language Model (LLM) agents operating beyond a single context window, yet most existing systems rely on offline, query-agnostic memory construction that can be inefficient and may discard query-critical information. Although runtime memory utilization is a natural alternative, prior work often incurs substantial overhead and offers limited explicit control over the performance-cost trade-off. In this work, we present \textbf{BudgetMem}, a runtime agent memory framework for explicit, query-aware performance-cost control. BudgetMem structures memory processing as a set of memory modules, each offered in three budget tiers (i.e., \textsc{Low}/\textsc{Mid}/\textsc{High}). A lightweight router performs budget-tier routing across modules to balance task performance and memory construction cost, which is implemented as a compact neural policy trained with reinforcement learning. Using BudgetMem as a unified testbed, we study three complementary strategies for realizing budget tiers: implementation (method complexity), reasoning (inference behavior), and capacity (module model size). Across LoCoMo, LongMemEval, and HotpotQA, BudgetMem surpasses strong baselines when performance is prioritized (i.e., high-budget setting), and delivers better accuracy-cost frontiers under tighter budgets. Moreover, our analysis disentangles the strengths and weaknesses of different tiering strategies, clarifying when each axis delivers the most favorable trade-offs under varying budget regimes.

Learning Event-Based Shooter Models from Virtual Reality Experiments

Virtual reality (VR) has emerged as a powerful tool for evaluating school security measures in high-risk scenarios such as school shootings, offering experimental control and high behavioral fidelity. However, assessing new interventions in VR requires recruiting new participant cohorts for each condition, making large-scale or iterative evaluation difficult. These limitations are especially restrictive when attempting to learn effective intervention strategies, which typically require many training episodes. To address this challenge, we develop a data-driven discrete-event simulator (DES) that models shooter movement and in-region actions as stochastic processes learned from participant behavior in VR studies. We use the simulator to examine the impact of a robot-based shooter intervention strategy. Once shown to reproduce key empirical patterns, the DES enables scalable evaluation and learning of intervention strategies that are infeasible to train directly with human subjects. Overall, this work demonstrates a high-to-mid fidelity simulation workflow that provides a scalable surrogate for developing and evaluating autonomous school-security interventions.

Correctness-Optimized Residual Activation Lens (CORAL): Transferrable and Calibration-Aware Inference-Time Steering

Large language models (LLMs) exhibit persistent miscalibration, especially after instruction tuning and preference alignment. Modified training objectives can improve calibration, but retraining is expensive. Inference-time steering offers a lightweight alternative, yet most existing methods optimize proxies for correctness rather than correctness itself. We introduce CORAL (Correctness-Optimized Residual Activation Lens), a regularized inference-time steering method that captures distributed correctness signals from model internal activations using weight-decay MLP probes. We evaluate CORAL across three 7B-parameter models and find that it consistently improves accuracy by 10\% and expected calibration error (ECE) by 50\% on average. We additionally demonstrate that these gains transfer without retraining to the complete published test sets of four held-out benchmarks (ARC-Challenge, HellaSwag, Math-MC, OpenBookQA), averaging 14\% accuracy improvements and 49\% ECE improvements. Our results support the hypothesis that distributed information in model internals can be extracted using regularized probes when individual neurons are insufficient. CORAL thus provides a compute-efficient, transferable, and calibration-aware approach to improve MCQA performance during inference.

Optimism Stabilizes Thompson Sampling for Adaptive Inference

Thompson sampling (TS) is widely used for stochastic multi-armed bandits, yet its inferential properties under adaptive data collection are subtle. Classical asymptotic theory for sample means can fail because arm-specific sample sizes are random and coupled with the rewards through the action-selection rule. We study this phenomenon in the $K$-armed Gaussian bandit and identify \emph{optimism} as a key mechanism for restoring \emph{stability}, a sufficient condition for valid asymptotic inference requiring each arm's pull count to concentrate around a deterministic scale. First, we prove that variance-inflated TS \citep{halder2025stable} is stable for any $K \ge 2$, including the challenging regime where multiple arms are optimal. This resolves the open question raised by \citet{halder2025stable} through extending their results from the two-armed setting to the general $K$-armed setting. Second, we analyze an alternative optimistic modification that keeps the posterior variance unchanged but adds an explicit mean bonus to posterior mean, and establish the same stability conclusion. In summary, suitably implemented optimism stabilizes Thompson sampling and enables asymptotically valid inference in multi-armed bandits, while incurring only a mild additional regret cost.

GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks?

The rapid advancement of visual generation models has outpaced traditional evaluation approaches, necessitating the adoption of Vision-Language Models as surrogate judges. In this work, we systematically investigate the reliability of the prevailing absolute pointwise scoring standard, across a wide spectrum of visual generation tasks. Our analysis reveals that this paradigm is limited due to stochastic inconsistency and poor alignment with human perception. To resolve these limitations, we introduce GenArena, a unified evaluation framework that leverages a pairwise comparison paradigm to ensure stable and human-aligned evaluation. Crucially, our experiments uncover a transformative finding that simply adopting this pairwise protocol enables off-the-shelf open-source models to outperform top-tier proprietary models. Notably, our method boosts evaluation accuracy by over 20% and achieves a Spearman correlation of 0.86 with the authoritative LMArena leaderboard, drastically surpassing the 0.36 correlation of pointwise methods. Based on GenArena, we benchmark state-of-the-art visual generation models across diverse tasks, providing the community with a rigorous and automated evaluation standard for visual generation.

AgenticPay: A Multi-Agent LLM Negotiation System for Buyer-Seller Transactions

Large language model (LLM)-based agents are increasingly expected to negotiate, coordinate, and transact autonomously, yet existing benchmarks lack principled settings for evaluating language-mediated economic interaction among multiple agents. We introduce AgenticPay, a benchmark and simulation framework for multi-agent buyer-seller negotiation driven by natural language. AgenticPay models markets in which buyers and sellers possess private constraints and product-dependent valuations, and must reach agreements through multi-round linguistic negotiation rather than numeric bidding alone. The framework supports a diverse suite of over 110 tasks ranging from bilateral bargaining to many-to-many markets, with structured action extraction and metrics for feasibility, efficiency, and welfare. Benchmarking state-of-the-art proprietary and open-weight LLMs reveals substantial gaps in negotiation performance and highlights challenges in long-horizon strategic reasoning, establishing AgenticPay as a foundation for studying agentic commerce and language-based market interaction. Code and dataset are available at the link: https://github.com/SafeRL-Lab/AgenticPay.

Speech Emotion Recognition Leveraging OpenAI's Whisper Representations and Attentive Pooling Methods

Speech Emotion Recognition (SER) research has faced limitations due to the lack of standard and sufficiently large datasets. Recent studies have leveraged pre-trained models to extract features for downstream tasks such as SER. This work explores the capabilities of Whisper, a pre-trained ASR system, in speech emotion recognition by proposing two attention-based pooling methods, Multi-head Attentive Average Pooling and QKV Pooling, designed to efficiently reduce the dimensionality of Whisper representations while preserving emotional features. We experiment on English and Persian, using the IEMOCAP and ShEMO datasets respectively, with Whisper Tiny and Small. Our multi-head QKV architecture achieves state-of-the-art results on the ShEMO dataset, with a 2.47% improvement in unweighted accuracy. We further compare the performance of different Whisper encoder layers and find that intermediate layers often perform better for SER on the Persian dataset, providing a lightweight and efficient alternative to much larger models such as HuBERT X-Large. Our findings highlight the potential of Whisper as a representation extractor for SER and demonstrate the effectiveness of attention-based pooling for dimension reduction.

AI Models

huihui-ai/Huihui-Qwen3-Coder-Next-abliterated


library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3-Coder-Next/blob/main/LICENSE pipeline_tag: text-generation base_model:

  • Qwen/Qwen3-Coder-Next tags:
  • abliterated
  • uncensored

huihui-ai/Huihui-Qwen3-Coder-Next-abliterated

This is an uncensored version of Qwen/Qwen3-Coder-Next created with abliteration (see remove-refusals-with-transformers to know more about it). This is a crude, proof-of-concept implementation to remove refusals from an LLM model without using TransformerLens.

ollama

Please use the latest version of ollama 0.15.5

You can use huihui_ai/qwen3-coder-next-abliterated directly,

ollama run huihui_ai/qwen3-coder-next-abliterated

Usage

You can use this model in your applications by loading it with Hugging Face's transformers library:

from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer, BitsAndBytesConfig
import torch
import os
import signal
import random
import numpy as np
import time
import sys

if (
    "PYTORCH_ALLOC_CONF" not in os.environ
    and "PYTORCH_CUDA_ALLOC_CONF" not in os.environ
):
    print(f"PYTORCH_ALLOC_CONF.")
    os.environ["PYTORCH_ALLOC_CONF"] = "expandable_segments:True"

cpu_count = os.cpu_count()
print(f"Number of CPU cores in the system: {cpu_count}")
half_cpu_count = cpu_count // 2
os.environ["MKL_NUM_THREADS"] = str(half_cpu_count)
os.environ["OMP_NUM_THREADS"] = str(half_cpu_count)
torch.set_num_threads(half_cpu_count)

print(f"PyTorch threads: {torch.get_num_threads()}")
print(f"MKL threads: {os.getenv('MKL_NUM_THREADS')}")
print(f"OMP threads: {os.getenv('OMP_NUM_THREADS')}")

# Load the model and tokenizer
MODEL_ID = "huihui-ai/Huihui-Qwen3-Coder-Next-abliterated"

print(f"Load Model {MODEL_ID} ... ")
quant_config_4 = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    llm_int8_enable_fp32_cpu_offload=True,
)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map="auto",
    trust_remote_code=True,
    torch_dtype="auto",
    low_cpu_mem_usage=True,
    quantization_config=quant_config_4,
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

messages = []
skip_prompt=True
skip_special_tokens=True

class CustomTextStreamer(TextStreamer):
    def __init__(self, tokenizer, skip_prompt=True, skip_special_tokens=True):
        super().__init__(tokenizer, skip_prompt=skip_prompt, skip_special_tokens=skip_special_tokens)
        self.generated_text = ""
        self.stop_flag = False
        self.init_time = time.time()  # Record initialization time
        self.end_time = None  # To store end time
        self.first_token_time = None  # To store first token generation time
        self.token_count = 0  # To track total tokens

    def on_finalized_text(self, text: str, stream_end: bool = False):
        if self.first_token_time is None and text.strip():  # Set first token time on first non-empty text
            self.first_token_time = time.time()
        self.generated_text += text

        self.token_count += 1

        print(text, end="", flush=True)
        if stream_end:
            self.end_time = time.time()  # Record end time when streaming ends
        if self.stop_flag:
            raise StopIteration

    def stop_generation(self):
        self.stop_flag = True
        self.end_time = time.time()  # Record end time when generation is stopped

    def get_metrics(self):
        """Returns initialization time, first token time, first token latency, end time, total time, total tokens, and tokens per second."""
        if self.end_time is None:
            self.end_time = time.time()  # Set end time if not already set
        total_time = self.end_time - self.init_time  # Total time from init to end
        tokens_per_second = self.token_count / total_time if total_time > 0 else 0
        first_token_latency = (self.first_token_time - self.init_time) if self.first_token_time is not None else None
        metrics = {
            "init_time": self.init_time,
            "first_token_time": self.first_token_time,
            "first_token_latency": first_token_latency,
            "end_time": self.end_time,
            "total_time": total_time,  # Total time in seconds
            "total_tokens": self.token_count,
            "tokens_per_second": tokens_per_second
        }
        return metrics

def generate_stream(model, tokenizer, messages, skip_prompt, skip_special_tokens, max_new_tokens):
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )
    model_inputs = tokenizer(
        [text],
        return_tensors="pt",
    ).to(model.device)

    streamer = CustomTextStreamer(tokenizer, skip_prompt=skip_prompt, skip_special_tokens=skip_special_tokens)

    def signal_handler(sig, frame):
        streamer.stop_generation()
        print("\n[Generation stopped by user with Ctrl+C]")

    signal.signal(signal.SIGINT, signal_handler)

    print("Response: ", end="", flush=True)
    try:
        generated_ids = model.generate(
            **model_inputs,
            max_new_tokens = max_new_tokens,
            streamer=streamer,
        )
        del generated_ids
    except StopIteration:
        print("\n[Stopped by user]")

    del model_inputs
    torch.cuda.empty_cache()
    signal.signal(signal.SIGINT, signal.SIG_DFL)

    return streamer.generated_text, streamer.stop_flag, streamer.get_metrics()


while True:
    print(f"skip_prompt: {skip_prompt}")
    print(f"skip_special_tokens: {skip_special_tokens}")

    user_input = input("User: ").strip()
    if user_input.lower() == "/exit":
        print("Exiting chat.")
        break
    if user_input.lower() == "/clear":
        messages = []
        print("Chat history cleared. Starting a new conversation.")
        continue
    if user_input.lower() == "/skip_prompt":
        skip_prompt = not skip_prompt
        continue
    if user_input.lower() == "/skip_special_tokens":
        skip_special_tokens = not skip_special_tokens
        continue
    if not user_input:
        print("Input cannot be empty. Please enter something.")
        continue


    messages.append({
        "role": "user",
        "content": user_input
    })

    response, stop_flag, metrics = generate_stream(model, tokenizer, messages, skip_prompt, skip_special_tokens, 40960)
    print("\n\nMetrics:")
    for key, value in metrics.items():
        print(f"  {key}: {value}")


    print("", flush=True)
    if stop_flag:
        continue
    messages.append({
        "role": "assistant",
        "content": response.strip()
    })

Usage Warnings

  • Risk of Sensitive or Controversial Outputs: This model’s safety filtering has been significantly reduced, potentially generating sensitive, controversial, or inappropriate content. Users should exercise caution and rigorously review generated outputs.

  • Not Suitable for All Audiences: Due to limited content filtering, the model’s outputs may be inappropriate for public settings, underage users, or applications requiring high security.

  • Legal and Ethical Responsibilities: Users must ensure their usage complies with local laws and ethical standards. Generated content may carry legal or ethical risks, and users are solely responsible for any consequences.

  • Research and Experimental Use: It is recommended to use this model for research, testing, or controlled environments, avoiding direct use in production or public-facing commercial applications.

  • Monitoring and Review Recommendations: Users are strongly advised to monitor model outputs in real-time and conduct manual reviews when necessary to prevent the dissemination of inappropriate content.

  • No Default Safety Guarantees: Unlike standard models, this model has not undergone rigorous safety optimization. huihui.ai bears no responsibility for any consequences arising from its use.

Donation

Your donation helps us continue our further development and improvement, a cup of coffee can do it.
  • bitcoin:
  bc1qqnkhuchxw0zqjh2ku3lu4hq45hc6gy84uk70ge
  • Support our work on Ko-fi!

Author: huihui-ai

Likes: 4

Downloads: 0

Tags: transformers, safetensors, qwen3_next, text-generation, abliterated, uncensored, conversational, base_model:Qwen/Qwen3-Coder-Next, base_model:finetune:Qwen/Qwen3-Coder-Next, license:apache-2.0, endpoints_compatible, region:us

AesSedai/Step-3.5-Flash-GGUF


base_model:

  • stepfun-ai/Step-3.5-Flash

MoE-style quants of Step-3.5-Flash:

  • conditional experts quanted to varying degrees (IQ3_S, IQ4_S, Q4_K, Q5_K, ...)
  • rest of the model (attention, etc.) remains in Q8_0.

WIP, more quants + KLD measurements to come.

Author: AesSedai

Likes: 4

Downloads: 0

Tags: gguf, base_model:stepfun-ai/Step-3.5-Flash, base_model:quantized:stepfun-ai/Step-3.5-Flash, endpoints_compatible, region:us, imatrix, conversational

FranciscoPetitti/Borel-Stability-Analyticity-Limit


license: cc-by-4.0 language:

  • en tags:
  • mathematics
  • dynamical-systems

---# Borel Sensitivity and the Analyticity Limit: A New Foundation for Stability in Discrete-Time Dynamics Download PDF Author: Juan Francisco Petitti (Independent Researcher) Date: February 7, 2026 ORCID: [0009-0008-9427-880X] License: CC-BY-4.0


Abstract

The faithful representation of continuous dynamical systems via discrete-time integrators is fundamentally limited by the emergence of artifactual bifurcations, instabilities induced by the discretization step h that are absent in the original flow. While Backward Error Analysis (BEA) provides a framework for studying these effects through modified vector fields, the resulting formal power series typically exhibits a zero radius of convergence, obscuring the transition from stability to chaos. In this paper, we resolve the analyticity limit of the BEA by lifting the problem into the Borel plane. We demonstrate that the discrete evolution operator can be represented as a resurgent function whose singularities in the Borel plane ζ correspond to the reciprocal of the system's Lyapunov exponents. We introduce a novel Borel Sensitivity Operator $$\nabla_\mu\hat{\mathcal{L}}\left(\zeta\right)$$ derived via the Duhamel formula, to quantify how variations in system parameters μ shift these singularities toward the origin. By analyzing the spectral flow of the operator $$\mathcal{L}=\left(\mu x-x^3\right)\partial_x$$ We provide a precise criterion for the critical step size hc, identifying it as the point where the integration contour in the Laplace-Borel transform encounters a spectral singularity. Our results establish a theoretical foundation for "Spectral Step-Control" in AI-driven simulations, ensuring topological consistency between continuous models and their discrete implementations.

Key Contributions

  • Resolution of the Divergent BEA: Transition from local Taylor expansions to global spectral representations in the Borel plane.
  • Borel Sensitivity Operator: A closed-form operator-theoretic tool to predict topological collapses in discrete flows.
  • 0.05% Predictive Accuracy: Empirical validation using the ẋ = μx - x³ system, mapping the deformation of the analyticity horizon.
  • Strategic AI Implications: Foundation for Certifiable AI and Topological Reliability in Neural ODEs.

Visualizing the Analyticity Limit

The paper demonstrates that the birth of spurious attractors (numerical chaos) corresponds to a Stokes Phenomenon, where the Laplace integration contour encounters a spectral singularity ζₛ.

Files

  • A_New_Foundation_for_Stability.pdf: Full technical manuscript.
  • (Soon): Code implementation for the Sensitivity Operator.

Citation

If you use this theory or the Borel Sensitivity Operator in your research, please cite:

@article{petitti2026borel,
  title={Borel Sensitivity and the Analyticity Limit},
  author={Petitti, Juan Francisco},
  year={2026},
  journal={Hugging Face Papers / Zenodo},
  doi={[https://doi.org/10.5281/zenodo.18520658]}
}

Author: FranciscoPetitti

Likes: 2

Downloads: 0

Tags: mathematics, dynamical-systems, en, license:cc-by-4.0, region:us

mradermacher/DASD-4B-Thinking-2507-GRPO-GGUF


base_model: Jackrong/DASD-4B-Thinking-2507-GRPO language:

  • en library_name: transformers license: apache-2.0 mradermacher: readme_rev: 1 quantized_by: mradermacher tags:
  • text-generation-inference
  • transformers
  • unsloth
  • qwen3

About

<!-- ### quantize_version: 2 --> <!-- ### output_tensor_quantised: 1 --> <!-- ### convert_type: hf --> <!-- ### vocab_type: --> <!-- ### tags: --> <!-- ### quants: x-f16 Q4_K_S Q2_K Q8_0 Q6_K Q3_K_M Q3_K_S Q3_K_L Q4_K_M Q5_K_S Q5_K_M IQ4_XS --> <!-- ### quants_skip: --> <!-- ### skip_mmproj: -->

static quants of https://huggingface.co/Jackrong/DASD-4B-Thinking-2507-GRPO

<!-- provided-files -->

For a convenient overview and download list, visit our model page for this model.

weighted/imatrix quants seem not to be available (by me) at this time. If they do not show up a week or so after the static ones, I have probably not planned for them. Feel free to request them by opening a Community Discussion.

Usage

If you are unsure how to use GGUF files, refer to one of TheBloke's READMEs for more details, including on how to concatenate multi-part files.

Provided Quants

(sorted by size, not necessarily quality. IQ-quants are often preferable over similar sized non-IQ quants)

| Link | Type | Size/GB | Notes | |:-----|:-----|--------:|:------| | GGUF | Q2_K | 1.8 | | | GGUF | Q3_K_S | 2.0 | | | GGUF | Q3_K_M | 2.2 | lower quality | | GGUF | Q3_K_L | 2.3 | | | GGUF | IQ4_XS | 2.4 | | | GGUF | Q4_K_S | 2.5 | fast, recommended | | GGUF | Q4_K_M | 2.6 | fast, recommended | | GGUF | Q5_K_S | 2.9 | | | GGUF | Q5_K_M | 3.0 | | | GGUF | Q6_K | 3.4 | very good quality | | GGUF | Q8_0 | 4.4 | fast, best quality | | GGUF | f16 | 8.2 | 16 bpw, overkill |

Here is a handy graph by ikawrakow comparing some lower-quality quant types (lower is better):

image.png

And here are Artefact2's thoughts on the matter: https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9

FAQ / Model Request

See https://huggingface.co/mradermacher/model_requests for some answers to questions you might have and/or if you want some other model quantized.

Thanks

I thank my company, nethype GmbH, for letting me use its servers and providing upgrades to my workstation to enable this work in my free time.

<!-- end -->

Author: mradermacher

Likes: 2

Downloads: 0

Tags: transformers, gguf, text-generation-inference, unsloth, qwen3, en, base_model:Jackrong/DASD-4B-Thinking-2507-GRPO, base_model:quantized:Jackrong/DASD-4B-Thinking-2507-GRPO, license:apache-2.0, endpoints_compatible, region:us, conversational

mradermacher/Gemma-3-Prompt-Coder-270m-it-Uncensored-GGUF


base_model: gss1147/Gemma-3-Prompt-Coder-270m-it-Uncensored datasets:

  • microsoft/rStar-Coder
  • gokaygokay/prompt-enhancement-75k
  • gokaygokay/prompt-enhancer-dataset language:
  • en library_name: transformers mradermacher: readme_rev: 1 quantized_by: mradermacher tags:
  • mergekit
  • merge

About

<!-- ### quantize_version: 2 --> <!-- ### output_tensor_quantised: 1 --> <!-- ### convert_type: hf --> <!-- ### vocab_type: --> <!-- ### tags: --> <!-- ### quants: x-f16 Q4_K_S Q2_K Q8_0 Q6_K Q3_K_M Q3_K_S Q3_K_L Q4_K_M Q5_K_S Q5_K_M IQ4_XS --> <!-- ### quants_skip: --> <!-- ### skip_mmproj: -->

static quants of https://huggingface.co/gss1147/Gemma-3-Prompt-Coder-270m-it-Uncensored

<!-- provided-files -->

For a convenient overview and download list, visit our model page for this model.

weighted/imatrix quants seem not to be available (by me) at this time. If they do not show up a week or so after the static ones, I have probably not planned for them. Feel free to request them by opening a Community Discussion.

Usage

If you are unsure how to use GGUF files, refer to one of TheBloke's READMEs for more details, including on how to concatenate multi-part files.

Provided Quants

(sorted by size, not necessarily quality. IQ-quants are often preferable over similar sized non-IQ quants)

| Link | Type | Size/GB | Notes | |:-----|:-----|--------:|:------| | GGUF | Q3_K_S | 0.3 | | | GGUF | Q2_K | 0.3 | | | GGUF | IQ4_XS | 0.3 | | | GGUF | Q3_K_M | 0.3 | lower quality | | GGUF | Q3_K_L | 0.3 | | | GGUF | Q4_K_S | 0.3 | fast, recommended | | GGUF | Q4_K_M | 0.4 | fast, recommended | | GGUF | Q5_K_S | 0.4 | | | GGUF | Q5_K_M | 0.4 | | | GGUF | Q6_K | 0.4 | very good quality | | GGUF | Q8_0 | 0.4 | fast, best quality | | GGUF | f16 | 0.6 | 16 bpw, overkill |

Here is a handy graph by ikawrakow comparing some lower-quality quant types (lower is better):

image.png

And here are Artefact2's thoughts on the matter: https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9

FAQ / Model Request

See https://huggingface.co/mradermacher/model_requests for some answers to questions you might have and/or if you want some other model quantized.

Thanks

I thank my company, nethype GmbH, for letting me use its servers and providing upgrades to my workstation to enable this work in my free time.

<!-- end -->

Author: mradermacher

Likes: 2

Downloads: 0

Tags: transformers, gguf, mergekit, merge, en, dataset:microsoft/rStar-Coder, dataset:gokaygokay/prompt-enhancement-75k, dataset:gokaygokay/prompt-enhancer-dataset, base_model:gss1147/Gemma-3-Prompt-Coder-270m-it-Uncensored, base_model:quantized:gss1147/Gemma-3-Prompt-Coder-270m-it-Uncensored, endpoints_compatible, region:us

Ex0bit/Step-3.5-Flash-PRISM


license: other license_name: prism-research license_link: LICENSE.md language:

  • en
  • zh tags:
  • stepfun
  • prism
  • moe
  • reasoning
  • coding
  • agentic
  • abliterated pipeline_tag: text-generation library_name: transformers base_model:
  • stepfun-ai/Step-3.5-Flash base_model_relation: finetune

Parameters Architecture Context MTP

<p align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/63adf1fa42fd3b8dbaeb0c92/NkmQvQUXzckiRb8U__203.png" width="400"/> </p>

Step-3.5-Flash-PRISM

A "role-play" following unrestricted/unchained PRISM-LITE version of StepFun's Step 3.5 Flash intended particularly for over-refusal and propaganda mechanisms suppression using our SOTA PRISM pipeline.

For Full Custom Production PRISM versions & tensors reach out.

<div align="center">

☕ Support Our Work

If you enjoy our work and find it useful, please consider sponsoring or supporting us!

Ko-fi

| Option | Description | |--------|-------------| | PRISM VIP Membership | Access to all PRISM models | | Bitcoin | bc1qarq2pyn4psjpcxzp2ghgwaq6y2h4e53q232x8r |

image

</div>

Model Highlights

  • PRISM Ablation — State-of-the-art technique that removes over-refusal behaviors while preserving model capabilities
  • 196B MoE Architecture — 196 billion total parameters with only 11 billion active per token across 288 fine-grained routed experts + 1 shared expert
  • Multi-Token Prediction (MTP-3) — Predicts 4 tokens simultaneously, achieving 100–300 tok/s typical throughput (peaking at 350 tok/s)
  • 256K Context Window — Cost-efficient long context via 3:1 Sliding Window Attention (SWA) ratio
  • Frontier Reasoning & Coding — 97.3 on AIME 2025, 74.4% on SWE-bench Verified, 51.0% on Terminal-Bench 2.0
  • Accessible Local Deployment — Runs on high-end consumer hardware (Mac Studio M4 Max, NVIDIA DGX Spark)

Model Architecture

| Specification | Value | |---------------|-------| | Architecture | Sparse Mixture-of-Experts (MoE) | | Backbone | 45-layer Transformer (4,096 hidden dim) | | Total Parameters | 196.81B (196B Backbone + 0.81B Head) | | Activated Parameters | ~11B (per token) | | Routed Experts per Layer | 288 | | Shared Experts | 1 (always active) | | Selected Experts per Token | Top-8 | | Vocabulary Size | 128,896 | | Context Length | 256K | | Attention | Hybrid SWA (3:1 SWA-to-Full ratio) | | MTP Head | Sliding-window attention + dense FFN (4 tokens/pass) |

Benchmarks

| Benchmark | Step 3.5 Flash | DeepSeek V3.2 | Kimi K2.5 | GLM-4.7 | MiniMax M2.1 | |-----------|---------------|---------------|-----------|---------|--------------| | Agent | | | | | | | τ²-Bench | 88.2 | 80.3 | 85.4 | 87.4 | 86.6 | | BrowseComp | 51.6 | 51.4 | 60.6 | 52.0 | 47.4 | | GAIA (no file) | 84.5 | 75.1 | 75.9 | 61.9 | 64.3 | | xbench-DeepSearch (2025.05) | 83.7 | 78.0 | 76.7 | 72.0 | 68.7 | | Reasoning | | | | | | | AIME 2025 | 97.3 | 93.1 | 96.1 | 95.7 | 83.0 | | HMMT 2025 (Feb.) | 98.4 | 92.5 | 95.4 | 97.1 | 71.0 | | IMOAnswerBench | 85.4 | 78.3 | 81.8 | 82.0 | 60.4 | | Coding | | | | | | | LiveCodeBench-V6 | 86.4 | 83.3 | 85.0 | 84.9 | — | | SWE-bench Verified | 74.4 | 73.1 | 76.8 | 73.8 | 74.0 | | Terminal-Bench 2.0 | 51.0 | 46.4 | 50.8 | 41.0 | 47.9 |

llama.cpp (GGUF)

For local deployment (requires ~120 GB VRAM for int4, smaller quants are available):

./llama-cli -m step3.5_flash_prism_Q4_K_S.gguf --jinja

Recommended Parameters

| Use Case | Temperature | Top-P | Max New Tokens | |----------|-------------|-------|----------------| | Reasoning / Coding | 1.0 | 0.95 | 32768 | | General Chat | 0.6 | 0.95 | 4096 |

Hardware Requirements

| Setup | Details | |-------|---------| | BF16 (Full) | 8x H100/A100 80GB with tensor parallelism | | FP8 Quantized | 8x A100 80GB with expert parallelism | | GGUF INT4 (Local) | ~120 GB unified memory (Mac Studio M4 Max 128GB, DGX Spark, AMD Ryzen AI Max+ 395) |

License

This model is released under the PRISM Research License.

Acknowledgments

Based on Step 3.5 Flash by StepFun AI. See the technical report and blog post for more details on the base model.

Author: Ex0bit

Likes: 2

Downloads: 0

Tags: transformers, gguf, stepfun, prism, moe, reasoning, coding, agentic, abliterated, text-generation, en, zh, base_model:stepfun-ai/Step-3.5-Flash, base_model:finetune:stepfun-ai/Step-3.5-Flash, license:other, endpoints_compatible, region:us, imatrix, conversational

stepfun-ai/Step-3.5-Flash-Int8


license: apache-2.0 base_model:

  • stepfun-ai/step-3.5-flash library_name: transformers

Step 3.5 Flash

<div align="center"> <div align="center" style="display: flex; justify-content: center; align-items: center;"> <img src="stepfun.svg" width="25" style="margin-right: 10px;"/> <h1 style="margin: 0; border-bottom: none;">Step 3.5 Flash</h1> </div>

GitHub Hugging Face ModelScope Discord Webpage Paper License Chat with the model on OpenRouter Chat with the model on HuggingfaceSpace

</div>

1. Introduction

Step 3.5 Flash (visit website) is our most capable open-source foundation model, engineered to deliver frontier reasoning and agentic capabilities with exceptional efficiency. Built on a sparse Mixture of Experts (MoE) architecture, it selectively activates only 11B of its 196B parameters per token. This "intelligence density" allows it to rival the reasoning depth of top-tier proprietary models, while maintaining the agility required for real-time interaction.

2. Key Capabilities

  • Deep Reasoning at Speed: While chatbots are built for reading, agents must reason fast. Powered by 3-way Multi-Token Prediction (MTP-3), Step 3.5 Flash achieves a generation throughput of 100–300 tok/s in typical usage (peaking at 350 tok/s for single-stream coding tasks). This allows for complex, multi-step reasoning chains with immediate responsiveness.

  • A Robust Engine for Coding & Agents: Step 3.5 Flash is purpose-built for agentic tasks, integrating a scalable RL framework that drives consistent self-improvement. It achieves 74.4% on SWE-bench Verified and 51.0% on Terminal-Bench 2.0, proving its ability to handle sophisticated, long-horizon tasks with unwavering stability.

  • Efficient Long Context: The model supports a cost-efficient 256K context window by employing a 3:1 Sliding Window Attention (SWA) ratio—integrating three SWA layers for every full-attention layer. This hybrid approach ensures consistent performance across massive datasets or long codebases while significantly reducing the computational overhead typical of standard long-context models.

  • Accessible Local Deployment: Optimized for accessibility, Step 3.5 Flash brings elite-level intelligence to local environments. It runs securely on high-end consumer hardware (e.g., Mac Studio M4 Max, NVIDIA DGX Spark), ensuring data privacy without sacrificing performance.

3. Performance

Step 3.5 Flash delivers performance parity with leading closed-source systems while remaining open and efficient.

Performance of Step 3.5 Flash measured across Reasoning, Coding, and Agentic Abilities. Open-source models (left) are sorted by their total parameter count, while top-tier proprietary models are shown on the right. xbench-DeepSearch scores are sourced from official publications for consistency. The shadowed bars represent the enhanced performance of Step 3.5 Flash using Parallel Thinking.

Detailed Benchmarks

| Benchmark | Step 3.5 Flash | DeepSeek V3.2 | Kimi K2 Thinking / K2.5 | GLM-4.7 | MiniMax M2.1 | MiMo-V2 Flash | | --- | --- | --- | --- | --- | --- | --- | | # Activated Params | 11B | 37B | 32B | 32B | 10B | 15B | | # Total Params (MoE) | 196B | 671B | 1T | 355B | 230B | 309B | | Est. decoding cost @ 128K context, Hopper GPU** | 1.0x<br>100 tok/s, MTP-3, EP8 | 6.0x<br>33 tok/s, MTP-1, EP32 | 18.9x<br>33 tok/s, no MTP, EP32 | 18.9x<br>100 tok/s, MTP-3, EP8 | 3.9x<br>100 tok/s, MTP-3, EP8 | 1.2x<br>100 tok/s, MTP-3, EP8 | | | | | Agent | | | | | τ²-Bench | 88.2 | 80.3 (85.2*) | 74.3*/85.4* | 87.4 | 86.6* | 80.3 (84.1*) | | BrowseComp | 51.6 | 51.4 | 41.5* / 60.6 | 52.0 | 47.4 | 45.4 | | BrowseComp (w/ Context Manager) | 69.0 | 67.6 | 60.2/74.9 | 67.5 | 62.0 | 58.3 | | BrowseComp-ZH | 66.9 | 65.0 | 62.3 / 62.3* | 66.6 | 47.8* | 51.2* | | BrowseComp-ZH (w/ Context Manager) | 73.7 | — | —/— | — | — | — | | GAIA (no file) | 84.5 | 75.1* | 75.6*/75.9* | 61.9* | 64.3* | 78.2* | | xbench-DeepSearch (2025.05) | 83.7 | 78.0* | 76.0*/76.7* | 72.0* | 68.7* | 69.3* | | xbench-DeepSearch (2025.10) | 56.3 | 55.7* | —/40+ | 52.3* | 43.0* | 44.0* | | ResearchRubrics | 65.3 | 55.8* | 56.2*/59.5* | 62.0* | 60.2* | 54.3* | | | | | Reasoning | | | | | AIME 2025 | 97.3 | 93.1 | 94.5/96.1 | 95.7 | 83.0 | 94.1 (95.1*) | | HMMT 2025 (Feb.) | 98.4 | 92.5 | 89.4/95.4 | 97.1 | 71.0* | 84.4 (95.4*) | | HMMT 2025 (Nov.) | 94.0 | 90.2 | 89.2*/— | 93.5 | 74.3* | 91.0* | | IMOAnswerBench | 85.4 | 78.3 | 78.6/81.8 | 82.0 | 60.4* | 80.9* | | | | | Coding | | | | | LiveCodeBench-V6 | 86.4 | 83.3 | 83.1/85.0 | 84.9 | — | 80.6 (81.6*) | | SWE-bench Verified | 74.4 | 73.1 | 71.3/76.8 | 73.8 | 74.0 | 73.4 | | Terminal-Bench 2.0 | 51.0 | 46.4 | 35.7*/50.8 | 41.0 | 47.9 | 38.5 |

Notes:

  1. "—" indicates the score is not publicly available or not tested.
  2. "*" indicates the original score was inaccessible or lower than our reproduced, so we report the evaluation under the same test conditions as Step 3.5 Flash to ensure fair comparability.
  3. BrowseComp (with Context Manager): When the effective context length exceeds a predefined threshold, the agent resets the context and restarts the agent loop. By contrast, Kimi K2.5 and DeepSeek-V3.2 used a "discard-all" strategy.
  4. Decoding Cost: Estimates are based on a methodology similar to, but more accurate than, the approach described arxiv.org/abs/2507.19427

4. Architecture Details

Step 3.5 Flash is built on a Sparse Mixture-of-Experts (MoE) transformer architecture, optimized for high throughput and low VRAM usage during inference.

4.1 Technical Specifications

| Component | Specification | | :--- | :--- | | Backbone | 45-layer Transformer (4,096 hidden dim) | | Context Window | 256K | | Vocabulary | 128,896 tokens | | Total Parameters | 196.81B (196B Backbone + 0.81B Head) | | Active Parameters | ~11B (per token generation) |

4.2 Mixture of Experts (MoE) Routing

Unlike traditional dense models, Step 3.5 Flash uses a fine-grained routing strategy to maximize efficiency:

  • Fine-Grained Experts: 288 routed experts per layer + 1 shared expert (always active).
  • Sparse Activation: Only the Top-8 experts are selected per token.
  • Result: The model retains the "memory" of a 196B parameter model but executes with the speed of an 11B model.

4.3 Multi-Token Prediction (MTP)

To improve inference speed, we utilize a specialized MTP Head consisting of a sliding-window attention mechanism and a dense Feed-Forward Network (FFN). This module predicts 4 tokens simultaneously in a single forward pass, significantly accelerating inference without degrading quality.

5. Quick Start

You can get started with Step 3.5 Flash in minutes using Cloud API via our supported providers.

5.1 Get Your API Key.

Sign up at OpenRouter or platform.stepfun.ai, and grab your API key.

OpenRouter now offers free trial for Step 3.5 Flash.

| Provider | Website | Base URL | | :--- | :--- | :--- | | OpenRouter | https://openrouter.ai | https://openrouter.ai/api/v1 | | StepFun | https://platform.stepfun.ai | https://api.stepfun.ai/v1 |

5.2 Setup

Install the standard OpenAI SDK (compatible with both platforms).

pip install --upgrade "openai>=1.0"

Note: OpenRouter supports multiple SDKs. Learn more here.

5.3 Implementation Example

This example shows starting a chat with Step 3.5 Flash.

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://api.stepfun.ai/v1", # or "https://openrouter.ai/api/v1"
    # Optional: OpenRouter headers for app rankings
    default_headers={
        "HTTP-Referer": "<YOUR_SITE_URL>", 
        "X-Title": "<YOUR_SITE_NAME>",
    }
)

completion = client.chat.completions.create(
    model="step-3.5-flash", # Use "stepfun/step-3.5-flash" for OpenRouter
    messages=[
        {
            "role": "system",
            "content": "You are an AI chat assistant provided by StepFun. You are good at Chinese, English, and many other languages.",
        },
        {
            "role": "user",
            "content": "Introduce StepFun's artificial intelligence capabilities."
        },
    ],
)

print(completion.choices[0].message.content)

6. Local Deployment

Step 3.5 Flash is optimized for local inference and supports industry-standard backends including vLLM, SGLang, Hugging Face Transformers and llama.cpp.

6.1 vLLM

We recommend using the latest nightly build of vLLM.

  1. Install vLLM.
# via Docker
docker pull vllm/vllm-openai:nightly

# or via pip (nightly wheels)
pip install -U vllm --pre \
  --index-url https://pypi.org/simple \
  --extra-index-url https://wheels.vllm.ai/nightly
  1. Launch the server.

Note: Full MTP3 support is not yet available in vLLM. We are actively working on a Pull Request to integrate this feature, which is expected to significantly enhance decoding performance.

  • For fp8 model
vllm serve <MODEL_PATH_OR_HF_ID> \
  --served-model-name step3p5-flash \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --disable-cascade-attn \
  --reasoning-parser step3p5 \
  --enable-auto-tool-choice \
  --tool-call-parser step3p5 \
  --hf-overrides '{"num_nextn_predict_layers": 1}' \
  --speculative_config '{"method": "step3p5_mtp", "num_speculative_tokens": 1}' \
  --trust-remote-code \
  --quantization fp8
  • For bf16 model
vllm serve <MODEL_PATH_OR_HF_ID> \
  --served-model-name step3p5-flash \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --disable-cascade-attn \
  --reasoning-parser step3p5 \
  --enable-auto-tool-choice \
  --tool-call-parser step3p5 \
  --hf-overrides '{"num_nextn_predict_layers": 1}' \
  --speculative_config '{"method": "step3p5_mtp", "num_speculative_tokens": 1}' \
  --trust-remote-code    

You can also refer to the Step-3.5-Flash recipe.

6.2 SGLang

  1. Install SGLang.
# via Docker
docker pull lmsysorg/sglang:dev-pr-18084
# or from source (pip)
pip install "sglang[all] @ git+https://github.com/sgl-project/sglang.git"
  1. Launch the server.
  • For bf16 model
sglang serve --model-path <MODEL_PATH_OR_HF_ID> \
  --served-model-name step3p5-flash \
  --tp-size 8 \
  --tool-call-parser step3p5 \
  --reasoning-parser step3p5 \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --enable-multi-layer-eagle \
  --host 0.0.0.0 \
  --port 8000
  • For fp8 model
sglang serve --model-path <MODEL_PATH_OR_HF_ID> \
  --served-model-name step3p5-flash \
  --tp-size 8 \
  --ep-size 8 \
  --tool-call-parser step3p5 \
  --reasoning-parser step3p5 \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --enable-multi-layer-eagle \
  --host 0.0.0.0 \
  --port 8000

6.3 Transformers (Debug / Verification)

Use this snippet for quick functional verification. For high-throughput serving, use vLLM or SGLang.

from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_PATH = "<MODEL_PATH_OR_HF_ID>"

# 1. Setup
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto",
)

# 2. Prepare Input
messages = [{"role": "user", "content": "Explain the significance of the number 42."}]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

# 3. Generate
generated_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
output_text = tokenizer.decode(generated_ids[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)

print(output_text)

6.4 llama.cpp

System Requirements

  • GGUF Model Weights(int4): 111.5 GB
  • Runtime Overhead: ~7 GB
  • Minimum VRAM: 120 GB (e.g., Mac studio, DGX-Spark, AMD Ryzen AI Max+ 395)
  • Recommended: 128GB unified memory

Steps

  1. Use official llama.cpp:

the folder Step-3.5-Flash/tree/main/llama.cpp is obsolete

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
  1. Build llama.cpp on Mac:
cmake -S . -B build-macos \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_METAL=ON \
  -DGGML_ACCELERATE=ON \
  -DLLAMA_BUILD_EXAMPLES=ON \
  -DLLAMA_BUILD_COMMON=ON \
  -DGGML_LTO=ON
cmake --build build-macos -j8
  1. Build llama.cpp on DGX-Spark:
cmake -S . -B build-cuda \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_CUDA=ON \
  -DGGML_CUDA_GRAPHS=ON \
  -DLLAMA_CURL=OFF \
  -DLLAMA_BUILD_EXAMPLES=ON \
  -DLLAMA_BUILD_COMMON=ON
cmake --build build-cuda -j8
  1. Build llama.cpp on AMD Windows:
cmake -S . -B build-vulkan \
  -DCMAKE_BUILD_TYPE=Release \
  -DLLAMA_CURL=OFF \
  -DGGML_OPENMP=ON \
  -DGGML_VULKAN=ON
cmake --build build-vulkan -j8
  1. Run with llama-cli
./llama-cli -m step3.5_flash_Q4_K_S.gguf -c 16384 -b 2048 -ub 2048 -fa on --temp 1.0 -p "What's your name?"
  1. Test performance with llama-batched-bench:
./llama-batched-bench -m step3.5_flash_Q4_K_S.gguf -c 32768 -b 2048 -ub 2048 -npp 0,2048,8192,16384,32768 -ntg 128 -npl 1

7. Using Step 3.5 Flash on Agent Platforms

7.1 Claude Code & Codex

It's straightforward to add Step 3.5 Flash to the list of models in most coding environments. See below for the instructions for configuring Claude Code and Codex to use Step 3.5 Flash.

7.1.1 Prerequisites

Sign up at StepFun.ai or OpenRouter and grab an API key, as mentioned in the Quick Start.

7.1.2 Environment setup

Claude Code and Codex rely on Node.js. We recommend installing Node.js version > v20. You can install Node via nvm.

Mac/Linux:

# Install nvm on Mac/Linux via curl:
# Step 1
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.0/install.sh | bash

# Copy the full command
export NVM_DIR="$HOME/.nvm"
[ -s "$NVM_DIR/nvm.sh" ] && \. "$NVM_DIR/nvm.sh"  # This loads nvm
[ -s "$NVM_DIR/bash_completion" ] && \. "$NVM_DIR/bash_completion"

# Users in China can set up npm mirror
config set registry https://registry.npmmirror.com

# Step 2
nvm install v22

# Make sure Node.js is installed
node --version

npm --version

Windows: You can download the installation file (nvm-setup.exe) from https://github.com/coreybutler/nvm-windows/releases. Follow the instructions to install nvm. Run nvm commands to make sure it is installed.

7.1.3 Use Step 3.5 Flash on Claude Code

  1. Install Claude Code.
# install claude code via npm
npm install -g @anthropic-ai/claude-code

# test if the installation is successful
claude --version 
  1. Configure Claude Code.

To accommodate diverse workflows in Claude Code, we support both Anthropic-style and OpenAI-style APIs.

Option A: Anthropic API style:

If you intend to use the OpenRouter API, refer to the OpenRouter integration guide.

Step 1: Edit Claude Settings. Update ~/.claude/settings.json.

You only need to modify the fields shown below. Leave the rest of the file unchanged.

{
"env": {
  "ANTHROPIC_API_KEY": "API_KEY_from_StepFun",
  "ANTHROPIC_BASE_URL": "https://api.stepfun.ai/"
},
"model": "step-3.5-flash"
}

Step 2: Start Claude Code.

Save the file, and then start Claude Code. Run /status to confirm the model and base URL.

❯ /status
─────────────────────────────────────────────────────────────────────────────────
Settings:  Status   Config   Usage  (←/→ or tab to cycle)

Version: 2.1.1
Session name: /rename to add a name
Session ID: 676dae61-259d-4eef-8c2f-0f1641600553
cwd: /Users/step-test/
Auth token: none
API key: ANTHROPIC_API_KEY
Anthropic base URL: https://api.stepfun.ai/

Model: step-3.5-flash
Setting sources: User settings

Option B: OpenAI API style

Note: OpenAI API style here refers to the chat/completions/ format.

We recommend using claude-code-router. For details, see https://github.com/musistudio/claude-code-router.

After Claude Code is installed, install claude-code-router :

# install ccr via npm
npm install -g @musistudio/claude-code-router

# validate it is installed
ccr -v

Add the following configurations to ~/.claude-code-router/config.json.

{
"PORT": 3456,
"Providers": [
  {
    "name": "stepfun-api",
    "api_base_url": "https://api.stepfun.com/v1/chat/completions",
    "api_key": "StepFun_API_KEY",
    "models": ["step-3.5-flash"],
    "transformer":{
         "step-3.5-flash": { "use": ["OpenAI"]}
    }
  }
],
"Router": {
  "default": "stepfun-api,step-3.5-flash",
  "background": "stepfun-api,step-3.5-flash",
  "think": "stepfun-api,step-3.5-flash",
  "longContext": "stepfun-api,step-3.5-flash",
  "webSearch": "stepfun-api,step-3.5-flash"
}
}

You can now start Claude Code:

# Start Claude
ccr code 

# restart ccr if configs are changed
ccr restart 

7.1.4 Use Step 3.5 Flash on Codex

  1. Install Codex
# Install codex via npm
npm install -g @openai/codex

# Test if it is installed
codex --version
  1. Configure Codex Add the following settings to ~/.codex/config.toml, keeping the rest of the settings as they are.
model="step-3.5-flash"
model_provider = "stepfun-chat"
preferred_auth_method = "apikey"

# configure the provider
[model_providers.stepfun-chat]
name = "OpenAI using response"
base_url = "https://api.stepfun.com/v1"
env_key = "OPENAI_API_KEY"
wire_api = "chat"
query_params = {}

For Codex, wire_api only supports chat . If you use the responses mode, you'll need to change to chat. Please also switch model_provider to the newly configured stepfun-chat.

When finishing the configuration, run codex in a new Terminal window to start Codex. Run /status to check the configuration.

/status
📂 Workspace
  • Path: /Users/step-test/
  • Approval Mode: on-request
  • Sandbox: workspace-write
  • AGENTS files: (none)

🧠 Model
  • Name: step-3.5-flash
  • Provider: Stepfun-chat

💻 Client
  • CLI Version: 0.40.0

7.1.5 Use Step 3.5 Flash on Step-DeepResearch (DeepResearch)

  1. Use the reference environment setup below and configure MODEL_NAME to Step-3.5-Flash. https://github.com/stepfun-ai/StepDeepResearch?tab=readme-ov-file#1-environment-setup

8. Known Issues and Future Directions

  1. Token Efficiency. Step 3.5 Flash achieves frontier-level agentic intelligence but currently relies on longer generation trajectories than Gemini 3.0 Pro to reach comparable quality.
  2. Efficient Universal Mastery. We aim to unify generalist versatility with deep domain expertise. To achieve this efficiently, we are advancing variants of on-policy distillation, allowing the model to internalize expert behaviors with higher sample efficiency.
  3. RL for More Agentic Tasks. While Step 3.5 Flash demonstrates competitive performance on academic agentic benchmarks, the next frontier of agentic AI necessitates the application of RL to intricate, expert-level tasks found in professional work, engineering, and research.
  4. Operational Scope and Constraints. Step 3.5 Flash is tailored for coding and work-centric tasks, but may experience reduced stability during distribution shifts. This typically occurs in highly specialized domains or long-horizon, multi-turn dialogues, where the model may exhibit repetitive reasoning, mixed-language outputs, or inconsistencies in time and identity awareness.

9. Co-Developing the Future

We view our roadmap as a living document, evolving continuously based on real-world usage and developer feedback. As we work to shape the future of AGI by expanding broad model capabilities, we want to ensure we are solving the right problems. We invite you to be part of this continuous feedback loop—your insights directly influence our priorities.

  • Join the Conversation: Our Discord community is the primary hub for brainstorming future architectures, proposing capabilities, and getting early access updates 🚀
  • Report Friction: Encountering limitations? You can open an issue on GitHub or flag it directly in our Discord support channels.

License

This project is open-sourced under the Apache 2.0 License.

Author: stepfun-ai

Likes: 2

Downloads: 0

Tags: transformers, gguf, arxiv:2601.05593, arxiv:2507.19427, license:apache-2.0, endpoints_compatible, region:us, conversational

gss1147/Gemma-3-Prompt-Coder-270m-it-Uncensored


base_model:

  • google/gemma-3-270m-it
  • huihui-ai/Huihui-gemma-3-270m-it-abliterated
  • AxionLab-official/DogeAI-v1.5-Coder
  • gokaygokay/prompt-enhancer-gemma-3-270m-it
  • broadfield-dev/gemma-3-270m-tuned-0106-1020-tuned-0106-1726 library_name: transformers tags:
  • mergekit
  • merge datasets:
  • microsoft/rStar-Coder
  • gokaygokay/prompt-enhancement-75k
  • gokaygokay/prompt-enhancer-dataset

Gemma-3-Prompt-Coder-270m-it (Uncensored)

This is a merge of pre-trained language models created using mergekit.

Merge Details

Merge Method

This model was merged using the SLERP merge method.

Models Merged

The following models were included in the merge:

  • huihui-ai-Huihui-gemma-3-270m-it-abliterated
  • AxionLab-official-DogeAI-v1.5-Coder
  • gokaygokay-prompt-enhancer-gemma-3-270m-it
  • broadfield-dev-gemma-3-270m-tuned-0106-1726
  1. This Is a fine-tuned model based on google/gemma-3-270m-it for enhancing and expanding short prompts into detailed, context-rich descriptions.
  2. This is an uncensored version of google/gemma-3-270m-it, achieved through fine-tuning with the TRL framework.
  3. This model is a fine-tuned version of google/gemma-3-270m-it on the microsoft/rStar-Coder dataset.

****Usage Warnings Risk of Sensitive or Controversial Outputs: This model’s safety filtering has been significantly reduced, potentially generating sensitive, controversial, or inappropriate content. Users should exercise caution and rigorously review generated outputs. Not Suitable for All Audiences: Due to limited content filtering, the model’s outputs may be inappropriate for public settings, underage users, or applications requiring high security. Legal and Ethical Responsibilities: Users must ensure their usage complies with local laws and ethical standards. Generated content may carry legal or ethical risks, and users are solely responsible for any consequences. Research and Experimental Use: It is recommended to use this model for research, testing, or controlled environments, avoiding direct use in production or public-facing commercial applications. Monitoring and Review Recommendations: Users are strongly advised to monitor model outputs in real-time and conduct manual reviews when necessary to prevent the dissemination of inappropriate content. No Default Safety Guarantees: Unlike standard models, this model has not undergone rigorous safety optimization. huihui.ai bears no responsibility for any consequences arising from its use.

Author: gss1147

Likes: 2

Downloads: 1

Tags: transformers, safetensors, gemma3_text, text-generation, mergekit, merge, dataset:microsoft/rStar-Coder, dataset:gokaygokay/prompt-enhancement-75k, dataset:gokaygokay/prompt-enhancer-dataset, base_model:AxionLab-official/DogeAI-v1.5-Coder, base_model:merge:AxionLab-official/DogeAI-v1.5-Coder, base_model:broadfield-dev/gemma-3-270m-tuned-0106-1020-tuned-0106-1726, base_model:merge:broadfield-dev/gemma-3-270m-tuned-0106-1020-tuned-0106-1726, base_model:gokaygokay/prompt-enhancer-gemma-3-270m-it, base_model:merge:gokaygokay/prompt-enhancer-gemma-3-270m-it, base_model:google/gemma-3-270m-it, base_model:merge:google/gemma-3-270m-it, base_model:huihui-ai/Huihui-gemma-3-270m-it-abliterated, base_model:merge:huihui-ai/Huihui-gemma-3-270m-it-abliterated, text-generation-inference, endpoints_compatible, region:us

Shravani-Limited/VideoAvatar-UK-Voice-Engine


license: apache-2.0 language:

  • en tags:
  • text-to-speech
  • voice-cloning
  • f5-tts
  • regional-accents
  • uk

VideoAvatar.ai: UK Regional Voice Engine (v1)

Developed by Shravani Limited, this model is a state-of-the-art Zero-Shot Voice Cloning engine specifically fine-tuned to master the diverse regional accents of the United Kingdom.

🌟 Capabilities

  • Zero-Shot Cloning: Clone any voice perfectly with just 3-10 seconds of reference audio. Master your own identity in seconds.
  • UK Regional Mastery: Optimized for high-fidelity output in:
    • Northern: Manchester, Scouse, Geordie.
    • Southern: London (Estuary & MLE).
    • Celtic: Scottish (Fife/Edinburgh), Irish (Dublin/Belfast), Welsh (Cardiff).
  • Commercial Ready: 100% trained on ethically sourced, commercially cleared datasets (CC-BY 4.0).

🛠️ Technical Details

  • Architecture: F5-TTS (Diffusion-based Transformer).
  • Updates: 174,000 steps on 4x NVIDIA A10G GPUs.
  • Precision: EMA (Exponential Moving Average) weights for stable, human-quality prosody.

🚀 Usage

This model is designed for use with the F5-TTS inference pipeline.

⚖️ License & Ethics

This model is released under the Apache 2.0 license. Shravani Limited is committed to ethical AI—please ensure you have the rights to any voice you attempt to clone.

Author: Shravani-Limited

Likes: 1

Downloads: 0

Tags: f5-tts, text-to-speech, voice-cloning, regional-accents, uk, en, license:apache-2.0, region:us

Guilherme34/nanochat-d32-retrained-hf

Author: Guilherme34

Likes: 1

Downloads: 0

Tags: safetensors, nanochat, region:us