Todays AI Summary

AI Developments: Expressive TTS, Image-to-LoRA, and Agentic Coding Models Emerge

Here's a look at some of the most interesting AI model and research paper releases from today, focusing on advancements in text-to-speech, image generation, and agentic coding.

Research Highlights

Several interesting research papers have been published, including:

  • Astra: General Interactive World Model with Autoregressive Denoising: Introduces a world model capable of generating real-world futures for diverse scenarios with precise action interactions. It uses an autoregressive denoising architecture and temporal causal attention.
  • Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs: This paper introduces benchmarks to evaluate cross-modal inconsistency in multimodal large language models (MLLMs), finding that MLLMs struggle to consistently reason across different modalities (image, text, mixed).
  • Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training: This paper challenges the view that predicting downstream task performance is unreliable, proposing a direct framework to model the scaling of benchmark performance from the training budget.
  • Toward Faithful Retrieval-Augmented Generation with Sparse Autoencoders: Introduces RAGLens, a lightweight hallucination detector that accurately flags unfaithful RAG outputs using LLM internal representations.

Model Releases

  • zai-org/GLM-TTS: This text-to-speech (TTS) model stands out with 50 likes. GLM-TTS is a high-quality system based on large language models, supporting zero-shot voice cloning and streaming inference. It uses a Multi-Reward Reinforcement Learning framework to improve the expressiveness of generated speech and achieves a low Character Error Rate (CER).
  • DiffSynth-Studio/Qwen-Image-i2L: This model introduces a novel approach: taking an image as input and outputting a LoRA model trained on that image. It includes models for style transfer and detail preservation.
  • unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF: This model has 19 likes and 3191 downloads. Devstral is an agentic LLM for software engineering tasks. Devstral Small 2 excels at using tools to explore codebases, editing multiple files and power software engineering agents. The model achieves remarkable performance on SWE-bench.

Key Takeaways

  • Advancements in TTS: GLM-TTS showcases the potential of reinforcement learning to create more expressive and controllable TTS systems.
  • Innovative Image Generation: Qwen-Image-i2L presents a unique method for generating LoRA models from images, opening new possibilities for style transfer and content creation.
  • Agentic Coding Models: Devstral-Small-2-24B-Instruct-2512-GGUF is designed to excel at agentic coding tasks, making it a great choice for software engineering agents.
  • Focus on Faithfulness: The RAGLens paper addresses a critical challenge in RAG systems: ensuring the faithfulness of generated content to the retrieved sources.

AI Papers for 2026-02-09

Shared LoRA Subspaces for almost Strict Continual Learning

Adapting large pretrained models to new tasks efficiently and continually is crucial for real-world deployment but remains challenging due to catastrophic forgetting and the high cost of retraining. While parameter-efficient tuning methods like low rank adaptation (LoRA) reduce computational demands, they lack mechanisms for strict continual learning and knowledge integration, without relying on data replay, or multiple adapters. We propose Share, a novel approach to parameter efficient continual finetuning that learns and dynamically updates a single, shared low-rank subspace, enabling seamless adaptation across multiple tasks and modalities. Share constructs a foundational subspace that extracts core knowledge from past tasks and incrementally integrates new information by identifying essential subspace directions. Knowledge from each new task is incorporated into this evolving subspace, facilitating forward knowledge transfer, while minimizing catastrophic interference. This approach achieves up to 100x parameter reduction and 281x memory savings over traditional LoRA methods, maintaining performance comparable to jointly trained models. A single Share model can replace hundreds of task-specific LoRA adapters, supporting scalable, asynchronous continual learning. Experiments across image classification, natural language understanding, 3D pose estimation, and text-to-image generation validate its effectiveness, making Share a practical and scalable solution for lifelong learning in large-scale AI systems.

DyTopo: Dynamic Topology Routing for Multi-Agent Reasoning via Semantic Matching

Multi-agent systems built from prompted large language models can improve multi-round reasoning, yet most existing pipelines rely on fixed, trajectory-wide communication patterns that are poorly matched to the stage-dependent needs of iterative problem solving. We introduce DyTopo, a manager-guided multi-agent framework that reconstructs a sparse directed communication graph at each round. Conditioned on the manager's round goal, each agent outputs lightweight natural-language query (need) and \key (offer) descriptors; DyTopo embeds these descriptors and performs semantic matching, routing private messages only along the induced edges. Across code generation and mathematical reasoning benchmarks and four LLM backbones, DyTopo consistently outperforms over the strongest baseline (avg. +6.2). Beyond accuracy, DyTopo yields an interpretable coordination trace via the evolving graphs, enabling qualitative inspection of how communication pathways reconfigure across rounds.

CommCP: Efficient Multi-Agent Coordination via LLM-Based Communication with Conformal Prediction

To complete assignments provided by humans in natural language, robots must interpret commands, generate and answer relevant questions for scene understanding, and manipulate target objects. Real-world deployments often require multiple heterogeneous robots with different manipulation capabilities to handle different assignments cooperatively. Beyond the need for specialized manipulation skills, effective information gathering is important in completing these assignments. To address this component of the problem, we formalize the information-gathering process in a fully cooperative setting as an underexplored multi-agent multi-task Embodied Question Answering (MM-EQA) problem, which is a novel extension of canonical Embodied Question Answering (EQA), where effective communication is crucial for coordinating efforts without redundancy. To address this problem, we propose CommCP, a novel LLM-based decentralized communication framework designed for MM-EQA. Our framework employs conformal prediction to calibrate the generated messages, thereby minimizing receiver distractions and enhancing communication reliability. To evaluate our framework, we introduce an MM-EQA benchmark featuring diverse, photo-realistic household scenarios with embodied questions. Experimental results demonstrate that CommCP significantly enhances the task success rate and exploration efficiency over baselines. The experiment videos, code, and dataset are available on our project website: https://comm-cp.github.io.

Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory

Memory is increasingly central to Large Language Model (LLM) agents operating beyond a single context window, yet most existing systems rely on offline, query-agnostic memory construction that can be inefficient and may discard query-critical information. Although runtime memory utilization is a natural alternative, prior work often incurs substantial overhead and offers limited explicit control over the performance-cost trade-off. In this work, we present \textbf{BudgetMem}, a runtime agent memory framework for explicit, query-aware performance-cost control. BudgetMem structures memory processing as a set of memory modules, each offered in three budget tiers (i.e., \textsc{Low}/\textsc{Mid}/\textsc{High}). A lightweight router performs budget-tier routing across modules to balance task performance and memory construction cost, which is implemented as a compact neural policy trained with reinforcement learning. Using BudgetMem as a unified testbed, we study three complementary strategies for realizing budget tiers: implementation (method complexity), reasoning (inference behavior), and capacity (module model size). Across LoCoMo, LongMemEval, and HotpotQA, BudgetMem surpasses strong baselines when performance is prioritized (i.e., high-budget setting), and delivers better accuracy-cost frontiers under tighter budgets. Moreover, our analysis disentangles the strengths and weaknesses of different tiering strategies, clarifying when each axis delivers the most favorable trade-offs under varying budget regimes.

Learning Event-Based Shooter Models from Virtual Reality Experiments

Virtual reality (VR) has emerged as a powerful tool for evaluating school security measures in high-risk scenarios such as school shootings, offering experimental control and high behavioral fidelity. However, assessing new interventions in VR requires recruiting new participant cohorts for each condition, making large-scale or iterative evaluation difficult. These limitations are especially restrictive when attempting to learn effective intervention strategies, which typically require many training episodes. To address this challenge, we develop a data-driven discrete-event simulator (DES) that models shooter movement and in-region actions as stochastic processes learned from participant behavior in VR studies. We use the simulator to examine the impact of a robot-based shooter intervention strategy. Once shown to reproduce key empirical patterns, the DES enables scalable evaluation and learning of intervention strategies that are infeasible to train directly with human subjects. Overall, this work demonstrates a high-to-mid fidelity simulation workflow that provides a scalable surrogate for developing and evaluating autonomous school-security interventions.

Correctness-Optimized Residual Activation Lens (CORAL): Transferrable and Calibration-Aware Inference-Time Steering

Large language models (LLMs) exhibit persistent miscalibration, especially after instruction tuning and preference alignment. Modified training objectives can improve calibration, but retraining is expensive. Inference-time steering offers a lightweight alternative, yet most existing methods optimize proxies for correctness rather than correctness itself. We introduce CORAL (Correctness-Optimized Residual Activation Lens), a regularized inference-time steering method that captures distributed correctness signals from model internal activations using weight-decay MLP probes. We evaluate CORAL across three 7B-parameter models and find that it consistently improves accuracy by 10\% and expected calibration error (ECE) by 50\% on average. We additionally demonstrate that these gains transfer without retraining to the complete published test sets of four held-out benchmarks (ARC-Challenge, HellaSwag, Math-MC, OpenBookQA), averaging 14\% accuracy improvements and 49\% ECE improvements. Our results support the hypothesis that distributed information in model internals can be extracted using regularized probes when individual neurons are insufficient. CORAL thus provides a compute-efficient, transferable, and calibration-aware approach to improve MCQA performance during inference.

Optimism Stabilizes Thompson Sampling for Adaptive Inference

Thompson sampling (TS) is widely used for stochastic multi-armed bandits, yet its inferential properties under adaptive data collection are subtle. Classical asymptotic theory for sample means can fail because arm-specific sample sizes are random and coupled with the rewards through the action-selection rule. We study this phenomenon in the $K$-armed Gaussian bandit and identify \emph{optimism} as a key mechanism for restoring \emph{stability}, a sufficient condition for valid asymptotic inference requiring each arm's pull count to concentrate around a deterministic scale. First, we prove that variance-inflated TS \citep{halder2025stable} is stable for any $K \ge 2$, including the challenging regime where multiple arms are optimal. This resolves the open question raised by \citet{halder2025stable} through extending their results from the two-armed setting to the general $K$-armed setting. Second, we analyze an alternative optimistic modification that keeps the posterior variance unchanged but adds an explicit mean bonus to posterior mean, and establish the same stability conclusion. In summary, suitably implemented optimism stabilizes Thompson sampling and enables asymptotically valid inference in multi-armed bandits, while incurring only a mild additional regret cost.

GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks?

The rapid advancement of visual generation models has outpaced traditional evaluation approaches, necessitating the adoption of Vision-Language Models as surrogate judges. In this work, we systematically investigate the reliability of the prevailing absolute pointwise scoring standard, across a wide spectrum of visual generation tasks. Our analysis reveals that this paradigm is limited due to stochastic inconsistency and poor alignment with human perception. To resolve these limitations, we introduce GenArena, a unified evaluation framework that leverages a pairwise comparison paradigm to ensure stable and human-aligned evaluation. Crucially, our experiments uncover a transformative finding that simply adopting this pairwise protocol enables off-the-shelf open-source models to outperform top-tier proprietary models. Notably, our method boosts evaluation accuracy by over 20% and achieves a Spearman correlation of 0.86 with the authoritative LMArena leaderboard, drastically surpassing the 0.36 correlation of pointwise methods. Based on GenArena, we benchmark state-of-the-art visual generation models across diverse tasks, providing the community with a rigorous and automated evaluation standard for visual generation.

AgenticPay: A Multi-Agent LLM Negotiation System for Buyer-Seller Transactions

Large language model (LLM)-based agents are increasingly expected to negotiate, coordinate, and transact autonomously, yet existing benchmarks lack principled settings for evaluating language-mediated economic interaction among multiple agents. We introduce AgenticPay, a benchmark and simulation framework for multi-agent buyer-seller negotiation driven by natural language. AgenticPay models markets in which buyers and sellers possess private constraints and product-dependent valuations, and must reach agreements through multi-round linguistic negotiation rather than numeric bidding alone. The framework supports a diverse suite of over 110 tasks ranging from bilateral bargaining to many-to-many markets, with structured action extraction and metrics for feasibility, efficiency, and welfare. Benchmarking state-of-the-art proprietary and open-weight LLMs reveals substantial gaps in negotiation performance and highlights challenges in long-horizon strategic reasoning, establishing AgenticPay as a foundation for studying agentic commerce and language-based market interaction. Code and dataset are available at the link: https://github.com/SafeRL-Lab/AgenticPay.

Speech Emotion Recognition Leveraging OpenAI's Whisper Representations and Attentive Pooling Methods

Speech Emotion Recognition (SER) research has faced limitations due to the lack of standard and sufficiently large datasets. Recent studies have leveraged pre-trained models to extract features for downstream tasks such as SER. This work explores the capabilities of Whisper, a pre-trained ASR system, in speech emotion recognition by proposing two attention-based pooling methods, Multi-head Attentive Average Pooling and QKV Pooling, designed to efficiently reduce the dimensionality of Whisper representations while preserving emotional features. We experiment on English and Persian, using the IEMOCAP and ShEMO datasets respectively, with Whisper Tiny and Small. Our multi-head QKV architecture achieves state-of-the-art results on the ShEMO dataset, with a 2.47% improvement in unweighted accuracy. We further compare the performance of different Whisper encoder layers and find that intermediate layers often perform better for SER on the Persian dataset, providing a lightweight and efficient alternative to much larger models such as HuBERT X-Large. Our findings highlight the potential of Whisper as a representation extractor for SER and demonstrate the effectiveness of attention-based pooling for dimension reduction.

AI Models

WithinUsAI/Gemma-3-Prompt-Coder-270m-it-Uncensored-GGUF


base_model:

  • google/gemma-3-270m-it
  • huihui-ai/Huihui-gemma-3-270m-it-abliterated
  • AxionLab-official/DogeAI-v1.5-Coder
  • gokaygokay/prompt-enhancer-gemma-3-270m-it
  • broadfield-dev/gemma-3-270m-tuned-0106-1020-tuned-0106-1726 library_name: transformers tags:
  • mergekit
  • merge datasets:
  • microsoft/rStar-Coder
  • gokaygokay/prompt-enhancement-75k
  • gokaygokay/prompt-enhancer-dataset

Gemma-3-Prompt-Coder-270m-it (Uncensored)

This is a merge of pre-trained language models created using mergekit.

Merge Details

Merge Method

This model was merged using the SLERP merge method.

Models Merged

The following models were included in the merge:

  • huihui-ai-Huihui-gemma-3-270m-it-abliterated
  • AxionLab-official-DogeAI-v1.5-Coder
  • gokaygokay-prompt-enhancer-gemma-3-270m-it
  • broadfield-dev-gemma-3-270m-tuned-0106-1726
  1. This Is a fine-tuned model based on google/gemma-3-270m-it for enhancing and expanding short prompts into detailed, context-rich descriptions.
  2. This is an uncensored version of google/gemma-3-270m-it, achieved through fine-tuning with the TRL framework.
  3. This model is a fine-tuned version of google/gemma-3-270m-it on the microsoft/rStar-Coder dataset.

****Usage Warnings Risk of Sensitive or Controversial Outputs: This model’s safety filtering has been significantly reduced, potentially generating sensitive, controversial, or inappropriate content. Users should exercise caution and rigorously review generated outputs. Not Suitable for All Audiences: Due to limited content filtering, the model’s outputs may be inappropriate for public settings, underage users, or applications requiring high security. Legal and Ethical Responsibilities: Users must ensure their usage complies with local laws and ethical standards. Generated content may carry legal or ethical risks, and users are solely responsible for any consequences. Research and Experimental Use: It is recommended to use this model for research, testing, or controlled environments, avoiding direct use in production or public-facing commercial applications. Monitoring and Review Recommendations: Users are strongly advised to monitor model outputs in real-time and conduct manual reviews when necessary to prevent the dissemination of inappropriate content. No Default Safety Guarantees: Unlike standard models, this model has not undergone rigorous safety optimization. huihui.ai bears no responsibility for any consequences arising from its use.

Author: WithinUsAI

Likes: 2

Downloads: 0

Tags: transformers, gguf, mergekit, merge, dataset:microsoft/rStar-Coder, dataset:gokaygokay/prompt-enhancement-75k, dataset:gokaygokay/prompt-enhancer-dataset, base_model:AxionLab-official/DogeAI-v1.5-Coder, base_model:merge:AxionLab-official/DogeAI-v1.5-Coder, base_model:broadfield-dev/gemma-3-270m-tuned-0106-1020-tuned-0106-1726, base_model:merge:broadfield-dev/gemma-3-270m-tuned-0106-1020-tuned-0106-1726, base_model:gokaygokay/prompt-enhancer-gemma-3-270m-it, base_model:merge:gokaygokay/prompt-enhancer-gemma-3-270m-it, base_model:google/gemma-3-270m-it, base_model:merge:google/gemma-3-270m-it, base_model:huihui-ai/Huihui-gemma-3-270m-it-abliterated, base_model:merge:huihui-ai/Huihui-gemma-3-270m-it-abliterated, endpoints_compatible, region:us

ParamTatva/sanskrit-ppo-hopper-v5


license: other license_name: paramtatva-commercial license_link: LICENSE language:

  • sa tags:
  • reinforcement-learning
  • ppo
  • mujoco
  • hopper
  • gymnasium
  • robotics
  • sanskrit
  • eval_results datasets:
  • custom metrics:
  • reward model-index:
  • name: sanskrit-ppo-hopper-v5 results:
    • task: type: reinforcement-learning name: Continuous Control dataset: type: gymnasium name: Hopper-v5 metrics:
      • type: mean_reward value: 3183.2 name: Best Return (seed=3, 2M steps)
    • task: type: reinforcement-learning name: Continuous Control dataset: type: gymnasium name: Walker2d-v5 metrics:
      • type: mean_reward value: 4918.5 name: Best Return (seed=42, 3M steps)
    • task: type: reinforcement-learning name: Continuous Control dataset: type: gymnasium name: HalfCheetah-v5 metrics:
      • type: mean_reward value: 5803.9 name: Best Return (seed=2, 3M steps, still training)
    • task: type: reinforcement-learning name: Continuous Control dataset: type: gymnasium name: Reacher-v5 metrics:
      • type: mean_reward value: -4.2 name: Best Return (seeds 1,3)

<p align="center"> <img src="banner.png" alt="Sanskrit-PPO Banner" width="100%"> </p>

🕉️ Sanskrit-PPO: Hopper-v5 SOTA

2979.5 peak reward on Hopper-v5 — 125% of the CleanRL benchmark (2382 ± 271).

This is the base PPO policy from the SanskritLM project — a research initiative by ParamTatva.org exploring how Sanskrit linguistic embeddings can drive robotic control. This release establishes the SOTA baseline; the Sanskrit-conditioned multi-task policy (which accepts behavioral commands in Sanskrit) is coming in a future release.

What's in This Release

| File | Description | |---|---| | hopper_v5_sota.pt | Trained PPO weights (135 KB) — 125% of CleanRL SOTA | | model.py | Agent architecture (2-layer MLP, ~10K params) | | train.py | Fully reproducible training script | | inference.py | CLI inference tool with --render support |

Results

| Metric | Value | |---|---| | Peak avg return (last 20) | 2979.5 | | Best checkpoint return | 2731.3 | | CleanRL benchmark | 2382 ± 271 | | Our ratio vs SOTA | 125% | | Training time | ~1 hour (single GPU) | | Steps to SOTA | ~300K of 1M |

Training Curve

Update  50  | 102K steps | Return:  716
Update  80  | 163K steps | Return: 2023
Update 150  | 307K steps | Return: 2296
Update 230  | 471K steps | Return: 2620
Update 375  | 768K steps | Return: 2731
PEAK                     | Return: 2979  ← 125% of CleanRL SOTA

Architecture

| Component | Details | |---|---| | Algorithm | PPO (Proximal Policy Optimization) | | Actor | MLP: 11 → 64 → 64 → 3, Tanh activations | | Critic | MLP: 11 → 64 → 64 → 1, Tanh activations | | Initialization | Orthogonal (√2 hidden, 0.01 actor, 1.0 critic) | | Parameters | ~10K |

The Key Insight: Environment Wrappers

The breakthrough was using Gymnasium's built-in normalization wrappers. Without them, PPO plateaus at ~474. With them: 2979.

env = gym.make("Hopper-v5")
env = gym.wrappers.RecordEpisodeStatistics(env)
env = gym.wrappers.FlattenObservation(env)
env = gym.wrappers.NormalizeObservation(env)
env = gym.wrappers.NormalizeReward(env, gamma=0.99)

Full investigation: The 371 Wall — A Detective Story

Quick Start

Inference

import torch
import gymnasium as gym
from model import Agent

agent = Agent(obs_dim=11, act_dim=3)
ckpt = torch.load("hopper_v5_sota.pt", map_location="cpu")
agent.load_state_dict(ckpt["model_state_dict"])
agent.eval()

env = gym.make("Hopper-v5", render_mode="human")
obs, _ = env.reset()
total_reward = 0

for _ in range(1000):
    with torch.no_grad():
        action, _, _, _ = agent.get_action_and_value(
            torch.FloatTensor(obs).unsqueeze(0)
        )
    obs, reward, term, trunc, _ = env.step(action.numpy().flatten())
    total_reward += reward
    if term or trunc: break

print(f"Episode reward: {total_reward:.0f}")

Train from Scratch

pip install gymnasium[mujoco] torch numpy
python train.py

~1 hour on a single GPU (T4/A100), ~4 hours on CPU.

🕉️ Coming Soon: Sanskrit-Conditioned Multi-Task Policy

This release is the base PPO agent — it takes raw observations and produces actions without any language conditioning.

The full SanskritLM pipeline adds a proprietary encoder that accepts behavioral commands in Sanskrit (Devanagari script) and conditions the policy via Feature-wise Linear Modulation (FiLM). A single policy learns multiple behaviors from Sanskrit commands:

| Sanskrit Command | Transliteration | Meaning | Behavior | |---|---|---|---| | अग्रे गच्छ | agre gaccha | "go forward" | Forward locomotion | | पृष्ठतः गच्छ | pṛṣṭhataḥ gaccha | "go backward" | Backward locomotion | | ऊर्ध्वं कूर्द | ūrdhvaṃ kūrda | "jump up" | Hopping/jumping | | तिष्ठ | tiṣṭha | "stand still" | Stationary balance |

Why Sanskrit? Sanskrit's compositional morphology (sandhi, vibhakti, dhātu system) produces inherently structured embeddings. A single verb root (dhātu) encodes motion type, direction, intensity, and aspect — information that requires multiple English words. This linguistic density gives the encoder a natural advantage for encoding complex behavioral commands.

The multi-task release will include:

  • Sanskrit-conditioned policy weights for Hopper, HalfCheetah, Walker2d, Humanoid, Ant, and Reacher
  • The encoder interface (commands must be in Sanskrit — use an LLM or translation API to generate Devanagari input)
  • Multi-environment benchmark results

🔔 Watch this repo for the multi-task release, or visit ParamTatva.org for updates.

Hyperparameters

| Parameter | Value | |---|---| | Total timesteps | 1,000,000 | | Learning rate | 3e-4 (linear anneal) | | Rollout steps | 2,048 | | Minibatch size | 64 | | Update epochs | 10 | | Gamma | 0.99 | | GAE Lambda | 0.95 | | Clip coefficient | 0.2 | | Value function clipping | ✓ | | Entropy coefficient | 0.0 | | Value loss coefficient | 0.5 | | Max gradient norm | 0.5 | | Seed | 1 |

License

Released under the ParamTatva Commercial License. See LICENSE.

  • ✅ Academic research and evaluation
  • ✅ Personal/educational use
  • ❌ Commercial deployment requires a separate license
  • ❌ Redistribution of weights without attribution

Citation

@misc{paramtatva2026ppohopper,
  title={Sanskrit-PPO: SOTA Reinforcement Learning with Linguistic Embeddings},
  author={ParamTatva Research},
  year={2026},
  url={https://huggingface.co/paramtatva/sanskrit-ppo-hopper-v5}
}

Contact

For commercial licensing and multi-task model access, contact the ParamTatva team at ParamTatva.org.

Author: ParamTatva

Likes: 2

Downloads: 0

Tags: reinforcement-learning, ppo, mujoco, hopper, gymnasium, robotics, sanskrit, eval_results, sa, dataset:custom, license:other, model-index, eval-results, region:us

silveroxides/Anima-Quantized


license: other license_name: circlestone-labs-non-commercial-license license_link: https://huggingface.co/circlestone-labs/Anima/raw/main/LICENSE.md tags:

  • diffusion-single-file
  • comfyui base_model:
  • circlestone-labs/Anima base_model_relation: quantized

Quantized weights using Learned Rounding indended for ComfyUI

Original Model Card:

<img src="https://huggingface.co/circlestone-labs/Anima/resolve/main/example.png" width="400">

Anima is a 2 billion parameter text-to-image model created via a collaboration between CircleStone Labs and Comfy Org. It is focused mainly on anime concepts, characters, and styles, but is also capable of generating a wide variety of other non-photorealistic content. The model is designed for making illustrations and artistic images, and will not work well at realism.

It is trained on several million anime images and about 800k non-anime artistic images. No synthetic data was used for training. The knowledge cut-off date for the anime training data is September 2025.

This preview version is an intermediate model checkpoint. The model is still training and the final version will improve, especially for fine details and overall aesthetics.

Installing and running

The model is natively supported in ComfyUI. The above image contains a workflow; you can open it in ComfyUI or drag-and-drop to get the workflow. The model files go in their respective folders inside your model directory:

  • anima-preview.safetensors goes in ComfyUI/models/diffusion_models
  • qwen_3_06b_base.safetensors goes in ComfyUI/models/text_encoders
  • qwen_image_vae.safetensors goes in ComfyUI/models/vae (this is the Qwen-Image VAE, you might already have it)

Generation settings

  • The preview version should be used at about 1MP resolution. E.g. 1024x1024, 896x1152, 1152x896, etc.
  • 30-50 steps, CFG 4-5.
  • A variety of samplers work. Some of my favorites:
    • er_sde: neutral style, flat colors, sharp lines. I use this as a reasonable default.
    • euler_a: Softer, thinner lines. Can sometimes tend towards a 2.5D look. CFG can be pushed a bit higher than other samplers without burning the image.
    • dpmpp_2m_sde_gpu: similar in style to er_sde but can produce more variety and be more "creative". Depending on the prompt it can get too wild sometimes.

Prompting

The model is trained on Danbooru-style tags, natural language captions, and combinations of tags and captions.

Tag order

[quality/meta/year/safety tags] [1girl/1boy/1other etc] [character] [series] [artist] [general tags]

Within each tag section, the tags can be in arbitrary order.

Quality tags

Human score based: masterpiece, best quality, good quality, normal quality, low quality, worst quality

PonyV7 aesthetic model based: score_9, score_8, ..., score_1

You can use either the human score quality tags, the aesthetic model tags, both together, or neither. All combinations work.

Time period tags

Specific year: year 2025, year 2024, ...

Period: newest, recent, mid, early, old

Meta tags

highres, absurdres, anime screenshot, jpeg artifacts, official art, etc

Safety tags

safe, sensitive, nsfw, explicit

Artist tags

Prefix artist with @. E.g. "@big chungus". You must put @ in front of the artist. The effect will be very weak if you don't.

Full tag example

year 2025, newest, normal quality, score_5, highres, safe, 1girl, oomuro sakurako, yuru yuri, @nnn yryr, smile, brown hair, hat, solo, fur-trimmed gloves, open mouth, long hair, gift box, fang, skirt, red gloves, blunt bangs, gloves, one eye closed, shirt, brown eyes, santa costume, red hat, skin fang, twitter username, white background, holding bag, fur trim, simple background, brown skirt, bag, gift bag, looking at viewer, santa hat, ;d, red shirt, box, gift, fur-trimmed headwear, holding, red capelet, holding box, capelet

Tag dropout

The model was trained with random tag dropout. You don't need to include every single relevant tag for the image.

Dataset tags

To improve style and content diversity, the model was additionally trained on two non-anime datasets: LAION-POP (specifically the ye-pop version) and DeviantArt. Both were filtered to exclude photos. Because these datasets are qualitatively different from anime datasets, captions from them have been labeled with a "dataset tag". This occurs at the very beginning of a prompt followed by a newline. Optionally, the second line can contain either the image alt-text (ye-pop) or the title of the work (DeviantArt). Examples:

ye-pop<br> For Sale: Others by Arun Prem<br> Abstract, oil painting of three faceless, blue-skinned figures. Left: white, draped figure; center: yellow-shirted, dark-haired figure; right: red-veiled, dark-haired figure carrying another. Bold, textured colors, minimalist style.

deviantart<br> Flame<br> Digital painting of a fiery dragon with glowing yellow eyes, black horns, and a long, sinuous tail, perched on a glowing, molten rock formation. The background is a gradient of dark purple to orange.

Natural language prompting tips

  • If using pure natural langauge, more descriptive is better. Aim for at least 2 sentences. Extremely short prompts can give unexpected results (this will be better in the final version).
  • You can mix tags and natural language in arbitrary order.
  • You can put quality / artist tags at the beginning of a natural language prompt.
    • "masterpiece, best quality, @big chungus. An anime girl with medium-length blonde hair is..."
  • Name a character, then describe their basic appearance.
    • "Digital artwork of Fern from Sousou no Frieren, with long purple hair and purple eyes, wearing a black coat over a white dress with puffy sleeves..."
    • This is extra important when prompting for multiple characters. If you just list off character names with no description of appearance, the model can get confused.

Model comparison

You may be interested in comparing Anima's outputs with other models. A ComfyUI workflow, anima_comparison.json, is provided. This workflow generates a grid of images where each model is a column and the rows are different seeds. It can be configured to compare any number of models you select by changing a few output nodes. Supported model architectures: Anima, SDXL, Lumina, Chroma, Newbie-Image. The default configuration compares Anima, NetaYume, and Newbie-Image.

Limitations

  • The model doesn't do realism well. This is intended. It is an anime / illustration / art focused model.
  • The model may generate undesired content, especially if the prompt is short or lacking details.
    • Avoid this by using the appropriate safety tags in the positive and negative prompts, and by writing sufficiently detailed prompts.
  • The model isn't great at text rendering. It can generally do single words and sometimes short phrases, but lengthy text rendering won't work well.
  • The preview model isn't that good at higher resolutions yet.
    • It is a medium-resolution intermediate checkpoint, trained on a small amount of high-res images.
    • The final version will have been trained on a dedicated high-res phase. Details and overall image composition will improve.
  • The preview model is a true base model. It hasn't been aesthetic tuned on a curated dataset. The default style is very plain and neutral, which is especially apparent if you don't use artist or quality tags.

License

This model is licensed under the CircleStone Labs Non-Commercial License. The model and derivatives are only usable for non-commercial purposes. Additionally, this model constitutes a "Derivative Model" of Cosmos-Predict2-2B-Text2Image, and therefore is subject to the NVIDIA Open Model License Agreement insofar as it applies to Derivative Models.

The details of the commercial licensing process are still being worked out. For now, you can express your interest in acquiring a commercial license by emailing tdrussell1@proton.me

Built on NVIDIA Cosmos.

Author: silveroxides

Likes: 2

Downloads: 20

Tags: diffusion-single-file, comfyui, base_model:circlestone-labs/Anima, base_model:quantized:circlestone-labs/Anima, license:other, region:us

Yevrey921/novaAnimeXL_ilV160


license: creativeml-openrail-m

https://civitai.com/models/376130?modelVersionId=2648201

Author: Yevrey921

Likes: 1

Downloads: 0

Tags: license:creativeml-openrail-m, region:us

simonko912/SLiNeP-max


license: apache-2.0 datasets:

  • tomg-group-umd/wikipedia-en-2k-samples
  • BASF-AI/WikipediaEasy10Classification language:
  • en

SLiNeP

Super Low Parameter Wikipedia-based Neural Predictor Super small gpt 2 based model trained from 11.6 Mb of wikipedia data.

Each topic starts with Topic Name : Topic Text and ends with <|endmsg|>. This is the giant version repo. The model will be done fully trained by feb 14th.

Author: simonko912

Likes: 1

Downloads: 0

Tags: en, dataset:tomg-group-umd/wikipedia-en-2k-samples, dataset:BASF-AI/WikipediaEasy10Classification, license:apache-2.0, region:us

mradermacher/GLM-4.7-Flash-REAP-23B-A3B-absolute-heresy-GGUF


base_model: MuXodious/GLM-4.7-Flash-REAP-23B-A3B-absolute-heresy language:

  • en library_name: transformers license: mit license_link: https://huggingface.co/zai-org/GLM-4.7-Flash/blob/main/LICENSE mradermacher: readme_rev: 1 quantized_by: mradermacher tags:
  • glm
  • MOE
  • pruning
  • compression
  • heretic
  • uncensored
  • decensored
  • abliterated

About

<!-- ### quantize_version: 2 --> <!-- ### output_tensor_quantised: 1 --> <!-- ### convert_type: hf --> <!-- ### vocab_type: --> <!-- ### tags: --> <!-- ### quants: x-f16 Q4_K_S Q2_K Q8_0 Q6_K Q3_K_M Q3_K_S Q3_K_L Q4_K_M Q5_K_S Q5_K_M IQ4_XS --> <!-- ### quants_skip: --> <!-- ### skip_mmproj: -->

static quants of https://huggingface.co/MuXodious/GLM-4.7-Flash-REAP-23B-A3B-absolute-heresy

<!-- provided-files -->

For a convenient overview and download list, visit our model page for this model.

weighted/imatrix quants seem not to be available (by me) at this time. If they do not show up a week or so after the static ones, I have probably not planned for them. Feel free to request them by opening a Community Discussion.

Usage

If you are unsure how to use GGUF files, refer to one of TheBloke's READMEs for more details, including on how to concatenate multi-part files.

Provided Quants

(sorted by size, not necessarily quality. IQ-quants are often preferable over similar sized non-IQ quants)

| Link | Type | Size/GB | Notes | |:-----|:-----|--------:|:------| | GGUF | Q2_K | 8.6 | | | GGUF | Q3_K_M | 11.2 | lower quality | | GGUF | Q4_K_S | 13.2 | fast, recommended | | GGUF | Q6_K | 19.0 | very good quality | | GGUF | Q8_0 | 24.6 | fast, best quality |

Here is a handy graph by ikawrakow comparing some lower-quality quant types (lower is better):

image.png

And here are Artefact2's thoughts on the matter: https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9

FAQ / Model Request

See https://huggingface.co/mradermacher/model_requests for some answers to questions you might have and/or if you want some other model quantized.

Thanks

I thank my company, nethype GmbH, for letting me use its servers and providing upgrades to my workstation to enable this work in my free time.

<!-- end -->

Author: mradermacher

Likes: 1

Downloads: 0

Tags: transformers, gguf, glm, MOE, pruning, compression, heretic, uncensored, decensored, abliterated, en, base_model:MuXodious/GLM-4.7-Flash-REAP-23B-A3B-absolute-heresy, base_model:quantized:MuXodious/GLM-4.7-Flash-REAP-23B-A3B-absolute-heresy, license:mit, endpoints_compatible, region:us, conversational

syvai/plapre-compact


language:

  • da tags:
  • text-to-speech
  • tts
  • danish
  • kanade license: apache-2.0 pipeline_tag: text-to-speech

Plapre Compact - Danish Text-to-Speech

A compact Danish TTS model based on SmolLM2-360M with a stripped vocabulary optimized for speech synthesis.

Architecture

  • Base model: SmolLM2-360M-Instruct (transformer layers)
  • Vocab: 12,885 tokens (compact - only phoneme + audio tokens, no unused text tokens)
  • Parameters: 327M
  • Audio tokenizer: Kanade (frothywater/kanade-25hz-clean) - 12,800 codebook, 25 tokens/sec, 24kHz
  • Phonemizer: espeak-ng (Danish)
  • Speaker embedding: 128-dim global embedding from Kanade (supports voice cloning)

How It Works

Text -> espeak-ng phonemizer -> phoneme tokens -> model generates audio tokens -> Kanade vocoder -> waveform

The model uses a simple sequence format:

<phonemes><phone_h><phone_ɑ><phone_j></phonemes><audio><audio_100><audio_200>...</audio><eos>

Usage

Install dependencies

pip install torch transformers phonemizer soundfile
pip install git+https://github.com/frothywater/kanade-tokenizer

You also need espeak-ng installed:

# Ubuntu/Debian
apt-get install espeak-ng

# macOS
brew install espeak-ng

Quick start

import torch
import soundfile as sf
from phonemizer import phonemize
from transformers import AutoModelForCausalLM, AutoTokenizer
from kanade_tokenizer import KanadeModel, load_vocoder, vocode

AUDIO_TOKEN_START = 85
AUDIO_TOKEN_END = 12884

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load model
tokenizer = AutoTokenizer.from_pretrained("syvai/plapre-compact")
model = AutoModelForCausalLM.from_pretrained(
    "syvai/plapre-compact", torch_dtype=torch.bfloat16
).to(device).eval()

# Load Kanade vocoder
kanade = KanadeModel.from_pretrained("frothywater/kanade-25hz-clean").eval().to(device)
vocoder = load_vocoder(kanade.config.vocoder_name).to(device)

# Phonemize Danish text
text = "Hej, hvordan har du det?"
phones = phonemize(
    [text], language="da", backend="espeak",
    strip=True, preserve_punctuation=True, language_switch="remove-flags"
)[0].strip()

# Build prompt tokens
phone_ids = [
    tokenizer.convert_tokens_to_ids(f"<phone_{c}>")
    for c in phones
    if tokenizer.convert_tokens_to_ids(f"<phone_{c}>") != tokenizer.unk_token_id
]
ps = tokenizer.convert_tokens_to_ids("<phonemes>")
pe = tokenizer.convert_tokens_to_ids("</phonemes>")
a_s = tokenizer.convert_tokens_to_ids("<audio>")
a_e = tokenizer.convert_tokens_to_ids("</audio>")

ids = [ps] + phone_ids + [pe, a_s]
input_ids = torch.tensor([ids], device=device)

# Generate audio tokens
with torch.no_grad():
    out = model.generate(
        input_ids=input_ids, max_new_tokens=500, do_sample=True,
        temperature=0.8, top_p=0.95, top_k=50,
        eos_token_id=[a_e, tokenizer.eos_token_id],
        pad_token_id=tokenizer.eos_token_id,
    )

# Extract and decode
gen = out[0].tolist()[len(ids):]
kanade_idx = [t - AUDIO_TOKEN_START for t in gen if AUDIO_TOKEN_START <= t <= AUDIO_TOKEN_END]

global_emb = torch.zeros(128, dtype=torch.float32, device=device)  # neutral voice
tokens_t = torch.tensor(kanade_idx, dtype=torch.long, device=device)

with torch.no_grad():
    mel = kanade.decode(content_token_indices=tokens_t, global_embedding=global_emb)
    waveform = vocode(vocoder, mel.unsqueeze(0))

sf.write("output.wav", waveform.squeeze().cpu().numpy(), 24000)

Voice cloning

Clone a voice by extracting a speaker embedding from a reference wav:

import torchaudio

wav, sr = torchaudio.load("reference_speaker.wav")
if sr != 24000:
    wav = torchaudio.functional.resample(wav, sr, 24000)

with torch.no_grad():
    features = kanade.encode(wav.to(device))
global_emb = features.global_embedding  # use this instead of zeros

CLI inference

# Basic usage
python inference_tts.py --text "Hej, hvordan har du det?"

# With voice cloning
python inference_tts.py --text "Hej verden" --speaker-wav reference.wav --output cloned.wav

# Adjust generation parameters
python inference_tts.py --text "Godmorgen" --temperature 0.7 --top-k 30

Training Details

  • Dataset: CoRal-TTS + CoRal-v2 read_aloud (208,244 samples, 339 hours, 538 speakers)
  • Tokenization: Kanade 25Hz audio tokens + espeak-ng Danish phonemes
  • Training: 5 epochs, LR 5e-3, cosine schedule, 20% warmup, effective batch size 32
  • Best checkpoint: epoch 4, val_loss 4.07, audio_accuracy 3.3%

Token Layout

| ID Range | Count | Description | |----------|-------|-------------| | 0 | 1 | <pad> | | 1 | 1 | <eos> | | 2 | 1 | <unk> | | 3-6 | 4 | Separators: <phonemes>, </phonemes>, <audio>, </audio> | | 7-84 | 78 | Phoneme tokens (<phone_X>) | | 85-12884 | 12,800 | Audio tokens (<audio_0> to <audio_12799>) |

Limitations

  • Danish language only
  • Requires espeak-ng for phonemization
  • Audio quality depends on the Kanade vocoder
  • Short utterances (< 1s) may be less stable

Author: syvai

Likes: 1

Downloads: 0

Tags: safetensors, llama, text-to-speech, tts, danish, kanade, da, license:apache-2.0, region:us

HyzeAI/HyzeMini


license: apache-2.0

Author: HyzeAI

Likes: 1

Downloads: 0

Tags: safetensors, llama, license:apache-2.0, region:us

dpateldev/northstarcore


license: mit

Author: dpateldev

Likes: 1

Downloads: 0

Tags: gguf, license:mit, endpoints_compatible, region:us, conversational

Igg0nk/tomato-disease-cnn

Author: Igg0nk

Likes: 1

Downloads: 0

Tags: pytorch, onnx, image-classification, region:us