Todays AI Summary

AI Developments: New Models Enhance Image Generation and Reasoning, Research Focuses on Efficient Learning

Here's a look at the latest AI models and research papers:

Research Papers:

Several interesting research papers have emerged, focusing on improving the efficiency and capabilities of AI in various domains:

  • EfficientFlow: This paper introduces a new framework for embodied AI that uses flow-based policy learning. It enhances data efficiency by incorporating equivariance into flow matching and accelerates sampling with a novel regularization strategy.
  • Diffusion Model Framework for Maximum Entropy Reinforcement Learning: This paper reinterprets Maximum Entropy Reinforcement Learning (MaxEntRL) as a diffusion model-based sampling problem, leading to improved returns and higher sample efficiency compared to existing methods.
  • Visual Sync: This paper presents an optimization framework for multi-camera synchronization, aligning unsynchronized videos with millisecond accuracy by minimizing epipolar error based on cross-view object motion.
  • Learning Sim-to-Real Humanoid Locomotion in 15 Minutes: This paper introduces a recipe based on off-policy RL algorithms that enables rapid training of humanoid locomotion policies in just 15 minutes.
  • RoaD: Rollouts as Demonstrations for Closed-Loop Supervised Fine-Tuning of Autonomous Driving Policies: This paper introduces a method to mitigate covariate shift in autonomous driving policies by leveraging the policy's own closed-loop rollouts as additional training data.
  • LLM CHESS: Benchmarking Reasoning and Instruction-Following in LLMs through Chess: This paper introduces an evaluation framework designed to probe the generalization of reasoning and instruction-following abilities in large language models (LLMs) through extended agentic interaction in the domain of chess.
  • Forecasting in Offline Reinforcement Learning for Non-stationary Environments: This paper introduces a framework that unifies conditional diffusion-based candidate state generation and zero-shot time-series foundation models to improve performance in non-stationary environments.
  • Chain-of-Ground: Improving GUI Grounding via Iterative Reasoning and Reference Feedback: This paper presents a training free multi step grounding framework that uses multimodal large language models for iterative visual reasoning and refinement.
  • AI-Driven Optimization under Uncertainty for Mineral Processing Operations: This paper introduces an AI-driven approach that formulates mineral processing as a Partially Observable Markov Decision Process (POMDP) to optimize mineral processing circuits under uncertainty.
  • From Atomic to Composite: Reinforcement Learning Enables Generalization in Complementary Reasoning: This paper investigates how RL contributes to reasoning capabilities through the lens of Complementary Reasoning.

Models:

Several new AI models have been released, showcasing advancements in image generation, instruction following, and reasoning:

  • alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union: This ControlNet model, trained from scratch on a large dataset of high-quality images, supports multiple control conditions like Canny, HED, Depth, and Pose. It allows for adjusting control context scale for stronger control and detail preservation. Likes: 128
  • apple/CLaRa-7B-Instruct: This instruction-tuned model with built-in semantic document compression supports instruction-following QA directly from compressed document representations. It demonstrates strong instruction-following performance under 16x compression. Likes: 35
  • unsloth/Ministral-3-14B-Instruct-2512-GGUF: This model offers capabilities and performance comparable to larger models. It supports vision, multilingual capabilities, system prompts, agentic capabilities, and is optimized for edge deployment. It achieves strong performance in reasoning and instruction-following benchmarks. Likes: 11

Key Takeaways:

  • Efficiency is a major focus: Research is heavily geared towards improving the efficiency of AI models, both in terms of data requirements and computational resources.
  • Reinforcement learning advancements: Several papers explore new techniques in reinforcement learning, particularly for robotics and autonomous systems, with a focus on handling uncertainty and non-stationary environments.
  • Multimodal models are gaining traction: Models like Ministral-3-14B-Instruct-2512 demonstrate the increasing capabilities of multimodal models that can process both text and images, opening up new possibilities for AI applications.
  • Reasoning and instruction following remain challenging: While progress is being made, benchmarks like LLM CHESS highlight the ongoing challenges in achieving robust reasoning and instruction-following abilities in large language models.

AI Papers for 2026-02-14

Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment

The long-standing vision of general-purpose robots hinges on their ability to understand and act upon natural language instructions. Vision-Language-Action (VLA) models have made remarkable progress toward this goal, yet their generated actions can still misalign with the given instructions. In this paper, we investigate test-time verification as a means to shrink the "intention-action gap.'' We first characterize the test-time scaling law for embodied instruction following and demonstrate that jointly scaling the number of rephrased instructions and generated actions greatly increases test-time sample diversity, often recovering correct actions more efficiently than scaling each dimension independently. To capitalize on these scaling laws, we present CoVer, a contrastive verifier for vision-language-action alignment, and show that our architecture scales gracefully with additional computational resources and data. We then introduce "boot-time compute" and a hierarchical verification inference pipeline for VLAs. At deployment, our framework precomputes a diverse set of rephrased instructions from a Vision-Language-Model (VLM), repeatedly generates action candidates for each instruction, and then uses a verifier to select the optimal high-level prompt and low-level action chunks. Compared to scaling policy pre-training on the same data, our verification approach yields 22% gains in-distribution and 13% out-of-distribution on the SIMPLER benchmark, with a further 45% improvement in real-world experiments. On the PolaRiS benchmark, CoVer achieves 14% gains in task progress and 9% in success rate.

UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs. Many multimodal tasks, especially those involving complex spatial compositions, multiple interacting objects, or evolving instructions, require decomposing instructions, verifying intermediate results, and making iterative corrections. While test-time scaling (TTS) has demonstrated that allocating additional inference compute for iterative reasoning substantially improves language model performance, extending this paradigm to unified multimodal models remains an open challenge. We introduce UniT, a framework for multimodal chain-of-thought test-time scaling that enables a single unified model to reason, verify, and refine across multiple rounds. UniT combines agentic data synthesis, unified model training, and flexible test-time inference to elicit cognitive behaviors including verification, subgoal decomposition, and content memory. Our key findings are: (1) unified models trained on short reasoning trajectories generalize to longer inference chains at test time; (2) sequential chain-of-thought reasoning provides a more scalable and compute-efficient TTS strategy than parallel sampling; (3) training on generation and editing trajectories improves out-of-distribution visual reasoning. These results establish multimodal test-time scaling as an effective paradigm for advancing both generation and understanding in unified models.

AttentionRetriever: Attention Layers are Secretly Long Document Retrievers

Retrieval augmented generation (RAG) has been widely adopted to help Large Language Models (LLMs) to process tasks involving long documents. However, existing retrieval models are not designed for long document retrieval and fail to address several key challenges of long document retrieval, including context-awareness, causal dependence, and scope of retrieval. In this paper, we proposed AttentionRetriever, a novel long document retrieval model that leverages attention mechanism and entity-based retrieval to build context-aware embeddings for long document and determine the scope of retrieval. With extensive experiments, we found AttentionRetriever is able to outperform existing retrieval models on long document retrieval datasets by a large margin while remaining as efficient as dense retrieval models.

Agentic Test-Time Scaling for WebAgents

Test-time scaling has become a standard way to improve performance and boost reliability of neural network models. However, its behavior on agentic, multi-step tasks remains less well-understood: small per-step errors can compound over long horizons; and we find that naive policies that uniformly increase sampling show diminishing returns. In this work, we present CATTS, a simple technique for dynamically allocating compute for multi-step agents. We first conduct an empirical study of inference-time scaling for web agents. We find that uniformly increasing per-step compute quickly saturates in long-horizon environments. We then investigate stronger aggregation strategies, including an LLM-based Arbiter that can outperform naive voting, but that can overrule high-consensus decisions. We show that uncertainty statistics derived from the agent's own vote distribution (entropy and top-1/top-2 margin) correlate with downstream success and provide a practical signal for dynamic compute allocation. Based on these findings, we introduce Confidence-Aware Test-Time Scaling (CATTS), which uses vote-derived uncertainty to allocate compute only when decisions are genuinely contentious. CATTS improves performance on WebArena-Lite and GoBrowse by up to 9.1% over React while using up to 2.3x fewer tokens than uniform scaling, providing both efficiency gains and an interpretable decision rule.

Creative Ownership in the Age of AI

Copyright law focuses on whether a new work is "substantially similar" to an existing one, but generative AI can closely imitate style without copying content, a capability now central to ongoing litigation. We argue that existing definitions of infringement are ill-suited to this setting and propose a new criterion: a generative AI output infringes on an existing work if it could not have been generated without that work in its training corpus. To operationalize this definition, we model generative systems as closure operators mapping a corpus of existing works to an output of new works. AI generated outputs are \emph{permissible} if they do not infringe on any existing work according to our criterion. Our results characterize structural properties of permissible generation and reveal a sharp asymptotic dichotomy: when the process of organic creations is light-tailed, dependence on individual works eventually vanishes, so that regulation imposes no limits on AI generation; with heavy-tailed creations, regulation can be persistently constraining.

CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use

AI agents are increasingly used to solve real-world tasks by reasoning over multi-turn user interactions and invoking external tools. However, applying reinforcement learning to such settings remains difficult: realistic objectives often lack verifiable rewards and instead emphasize open-ended behaviors; moreover, RL for multi-turn, multi-step agentic tool use is still underexplored; and building and maintaining executable tool environments is costly, limiting scale and coverage. We propose CM2, an RL framework that replaces verifiable outcome rewards with checklist rewards. CM2 decomposes each turn's intended behavior into fine-grained binary criteria with explicit evidence grounding and structured metadata, turning open-ended judging into more stable classification-style decisions. To balance stability and informativeness, our method adopts a strategy of sparse reward assignment but dense evaluation criteria. Training is performed in a scalable LLM-simulated tool environment, avoiding heavy engineering for large tool sets. Experiments show that CM2 consistently improves over supervised fine-tuning. Starting from an 8B Base model and training on an 8k-example RL dataset, CM2 improves over the SFT counterpart by 8 points on tau^-Bench, by 10 points on BFCL-V4, and by 12 points on ToolSandbox. The results match or even outperform similarly sized open-source baselines, including the judging model. CM2 thus provides a scalable recipe for optimizing multi-turn, multi-step tool-using agents without relying on verifiable rewards. Code provided by the open-source community: https://github.com/namezhenzhang/CM2-RLCR-Tool-Agent.

Think like a Scientist: Physics-guided LLM Agent for Equation Discovery

Explaining observed phenomena through symbolic, interpretable formulas is a fundamental goal of science. Recently, large language models (LLMs) have emerged as promising tools for symbolic equation discovery, owing to their broad domain knowledge and strong reasoning capabilities. However, most existing LLM-based systems try to guess equations directly from data, without modeling the multi-step reasoning process that scientists often follow: first inferring physical properties such as symmetries, then using these as priors to restrict the space of candidate equations. We introduce KeplerAgent, an agentic framework that explicitly follows this scientific reasoning process. The agent coordinates physics-based tools to extract intermediate structure and uses these results to configure symbolic regression engines such as PySINDy and PySR, including their function libraries and structural constraints. Across a suite of physical equation benchmarks, KeplerAgent achieves substantially higher symbolic accuracy and greater robustness to noisy data than both LLM and traditional baselines.

On the implicit regularization of Langevin dynamics with projected noise

We study Langevin dynamics with noise projected onto the directions orthogonal to an isometric group action. This mathematical model is introduced to shed new light on the effects of symmetry on stochastic gradient descent for over-parametrized models. Our main result identifies a novel form of implicit regularization: when the initial and target density are both invariant under the group action, Langevin dynamics with projected noise is equivalent in law to Langevin dynamics with isotropic diffusion but with an additional drift term proportional to the negative log volume of the group orbit. We prove this result by constructing a coupling of the two processes via a third process on the group itself, and identify the additional drift as the mean curvature of the orbits.

A technical curriculum on language-oriented artificial intelligence in translation and specialised communication

This paper presents a technical curriculum on language-oriented artificial intelligence (AI) in the language and translation (L&T) industry. The curriculum aims to foster domain-specific technical AI literacy among stakeholders in the fields of translation and specialised communication by exposing them to the conceptual and technical/algorithmic foundations of modern language-oriented AI in an accessible way. The core curriculum focuses on 1) vector embeddings, 2) the technical foundations of neural networks, 3) tokenization and 4) transformer neural networks. It is intended to help users develop computational thinking as well as algorithmic awareness and algorithmic agency, ultimately contributing to their digital resilience in AI-driven work environments. The didactic suitability of the curriculum was tested in an AI-focused MA course at the Institute of Translation and Multilingual Communication at TH Koeln. Results suggest the didactic effectiveness of the curriculum, but participant feedback indicates that it should be embedded into higher-level didactic scaffolding - e.g., in the form of lecturer support - in order to enable optimal learning conditions.

"Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most

Despite speech recognition systems achieving low word error rates on standard benchmarks, they often fail on short, high-stakes utterances in real-world deployments. Here, we study this failure mode in a high-stakes task: the transcription of U.S. street names as spoken by U.S. participants. We evaluate 15 models from OpenAI, Deepgram, Google, and Microsoft on recordings from linguistically diverse U.S. speakers and find an average transcription error rate of 44%. We quantify the downstream impact of failed transcriptions by geographic locations and show that mis-transcriptions systematically cause errors for all speakers, but that routing distance errors are twice as large for non-English primary speakers compared to English primary speakers. To mitigate this harm, we introduce a synthetic data generation approach that produces diverse pronunciations of named entities using open-source text-to-speech models. Fine-tuning with less than 1,000 synthetic samples improves street name transcription accuracy by nearly 60% (relative to base models) for non-English primary speakers. Our results highlight a critical gap between benchmark performance and real-world reliability in speech systems and demonstrate a simple, scalable path to reducing high-stakes transcription errors.

AI Models

deepgenteam/DeepGen-1.0


license: apache-2.0 datasets:

  • Alex11556666/Reason_Tuning base_model:
  • Qwen/Qwen2.5-VL-3B-Instruct pipeline_tag: image-to-image

πŸ’‘ DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

<p align="left"> <a href="http://arxiv.org/abs/2602.12205"> <img src="https://img.shields.io/badge/DeepGen 1.0-Paper-red?logo=arxiv&logoColor=red" style="display: inline-block; vertical-align: middle;" alt="DeepGen 1.0 Paper on arXiv" /> </a> <a href="https://github.com/deepgenteam/deepgen" target="_blank" style="margin: 2px;"> <img alt="Github" src="https://img.shields.io/badge/DeepGen 1.0-Codebase-536af5?color=536af5&logo=github" style="display: inline-block; vertical-align: middle;" alt="DeepGen 1.0 Codebase" /> </a> <a href="https://deepgenteam.github.io/" target="_blank" style="margin: 2px;"> <img alt="Github" src="https://img.shields.io/badge/Website-project page-orange" style="display: inline-block; vertical-align: middle;" alt="DeepGen 1.0 page" /> </a> </p> DeepGen 1.0 is a lightweight unified multimodal model with only 5B parameters (3B VLM + 2B DiT). It integrates five core capabilitiesβ€”general image generation, general image editing, reasoning image generation, reasoning image editing, and text renderingβ€”within a single model. Across multiple authoritative benchmarks, DeepGen 1.0 is competitive with competitive with or surpassing the state-of-the-art unified multimodal models that are 3Γ— to 16Γ— larger, achieving comprehensive performance, demonstrating that massive scaling is not the sole path to high-performance multimodal generation. <p align="left"><img src="bubble_chart.png" width="80%"></p>

🧠 Method

Our core observation is that a lightweight model, when empowered by synergistic architecture design and data-centric training strategies, can achieve comprehensive capabilities competitive with or even surpassing much larger counterparts. To overcome the limitations of lightweight models in semantic understanding and fine-grained control, we introduce Stacked Channel Bridging (SCB), a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable ``think tokens'' to provide the generative backbone with structured, reasoning-rich guidance. We further design a data-centric training strategy spanning three progressive stages: (1) Alignment Pre-training on large-scale image-text pairs and editing triplets to synchronize VLM and DiT representations, (2) Joint Supervised Fine-tuning on a high-quality mixture of generation, editing, and reasoning tasks to foster omni-capabilities, and (3) Reinforcement Learning with MR-GRPO, which leverages a mixture of reward functions and supervision signals, resulting in substantial gains in generation quality and alignment with human preferences, while maintaining stable training progress and avoiding visual artifacts.

<p align="left"><img src="arch.png" width="80%"></p>

πŸ“Š Benchmarks

1. General Image Generation

| Model | Params | Geneval ↑ | DPGBench ↑ | UniGenBench ↑ | | --------------------- | ----------- | ----------- | ------------ | ------------- | | OmniGen2 | 3B + 4B | 0.80 | 83.57 | 63.09 | | BAGEL | 14B | 0.82 | 85.10 | 61.53 | | X-Omni | 7B + 12B | 0.83 | 87.65πŸ₯‰ | 53.77 | | Lumina-DiMOO | 8B | 0.88πŸ₯‡ | 86.04 | 71.12 | | Hunyuan-Image-3.0 | 80B | 0.72 | 86.10 | β€” | | Qwen-Image | 7B + 20B | 0.87 πŸ₯ˆ | 88.32 πŸ₯‡ | 78.81 πŸ₯‡ | | LongCat-Image | 7B + 6B | 0.87 πŸ₯ˆ | 86.80 | β€” | | Z-Image-Turbo | 4B + 6B | 0.84 | 85.15 | 71.40 | | GLM-Image | 9B + 7B | β€” | 84.78 | β€” | | DeepGen 1.0 (SFT) | 3B + 2B | 0.86 πŸ₯‰ | 87.05 | 74.18 πŸ₯‰ | | DeepGen 1.0 (RL) | 3B + 2B | 0.87 πŸ₯ˆ | 87.90 πŸ₯ˆ | 75.74 πŸ₯ˆ |

2. General Image Editing

| Model | Params | GEdit-EN ↑ | ImgEdit ↑ | | :--- | :--- | :--- | :--- | | BAGEL | 14B | 6.52 | 3.20 | | Qwen-Image-Edit [2509] | 7B + 20B | 7.54 πŸ₯ˆ | 4.35 πŸ₯ˆ | | LongCat-Image-Edit | 7B + 6B | 7.60 πŸ₯‡ | 4.50 πŸ₯‡ | | Mammoth2 | 8B + 3B + 2B | 6.60 | 4.06 | | DeepGen 1.0 (SFT) | 3B + 2B | 7.12 | 4.09 | | DeepGen 1.0 (RL) | 3B + 2B | 7.17 πŸ₯‰ | 4.14 πŸ₯‰ |

3. Reasoning Image Generation

| Model | Params | WISE ↑ | T2I-CoREBench ↑ | | :--- | :--- | :--- | :--- | | OmniGen2 | 3B + 4B | 0.47 | 36.1 | | BAGEL | 14B | 0.70 πŸ₯‰ | 41.1 | | Hunyuan-Image-3.0 | 80B | 0.57 | 46.0 | | Qwen-Image | 7B + 20B | 0.62 | 46.3 πŸ₯‰ | | LongCat-Image | 7B + 6B | 0.65 | 52.2 πŸ₯‡ | | Z-Image-Turbo | 4B + 6B | - | 43.7 | | DeepGen 1.0 (SFT) | 3B + 2B | 0.72 πŸ₯ˆ | 45.7 | | DeepGen 1.0 (RL) | 3B + 2B | 0.73 πŸ₯‡ | 46.5 πŸ₯ˆ |

4. Reasoning Image Editing

| Model | Params | RISE ↑ | UniREditBench ↑ | | :--- | :--- | :--- | :--- | | OmniGen2 | 3B + 4B | - | 43.4 | | BAGEL | 14B | 11.9 πŸ₯ˆ | 51.0 | | Qwen-Image-Edit [2509] | 7B + 20B | 8.9 | 56.5 πŸ₯‰ | | DeepGen 1.0 (SFT) | 3B + 2B | 13.3 πŸ₯‡ | 77.5 πŸ₯‡ | | DeepGen 1.0 (RL) | 3B + 2B | 10.8 πŸ₯‰ | 75.7 πŸ₯ˆ |

🎨 Quantitative results

<p align="left"><img src="teaser.png" width="80%"></p>

πŸ› οΈ Usage

Merge ZIP Files

To use the DeepGen checkpoints, please merge the sharded model files first. We release Pre-traning, Supervised Fine-Tuning and Reinforcement Learning checkpoints.

# Merge zip
cat DeepGen_CKPT.zip.part-* > DeepGen_CKPT.zip
# Unzip DeepGen checkpoints 
unzip DeepGen_CKPT.zip

Author: deepgenteam

Likes: 22

Downloads: 0

Tags: image-to-image, dataset:Alex11556666/Reason_Tuning, arxiv:2602.12205, base_model:Qwen/Qwen2.5-VL-3B-Instruct, base_model:finetune:Qwen/Qwen2.5-VL-3B-Instruct, license:apache-2.0, region:us

ubergarm/MiniMax-M2.5-GGUF


quantized_by: ubergarm pipeline_tag: text-generation base_model: MiniMaxAI/MiniMax-M2.5 base_model_relation: quantized license_name: modified-mit license_link: https://github.com/MiniMax-AI/MiniMax-M2.5/blob/main/LICENSE tags:

  • imatrix
  • conversational
  • minimax_m2
  • ik_llama.cpp

ik_llama.cpp imatrix Quantizations of MiniMaxAI/MiniMax-M2.5

NOTE ik_llama.cpp can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.

Some of ik's new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP with Windows builds for CUDA 12.9. Also check for Windows builds by Thireus here. which have been CUDA 12.8.

These quants provide best in class perplexity for the given memory footprint.

Big Thanks

Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!!

Also thanks to all the folks in the quanting and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models! Thanks to huggingface for hosting all these big quants!

Finally, I really appreciate the support from aifoundry.org so check out their open source RISC-V based solutions!

Quant Collection

Perplexity computed against wiki.test.raw. (lower is "better")

Perplexity Chart

These two are just a test quants for baseline perplexity comparison and not available for download here:

  • BF16 426.060 GiB (16.003 BPW)
    • PPL over 552 chunks for n_ctx=512 = 8.3386 +/- 0.06651
  • Q8_0 226.431 GiB (8.505 BPW)
    • PPL over 552 chunks for n_ctx=512 = 8.3590 +/- 0.06673

NOTE: The first split file is much smaller on purpose to only contain metadata, its fine!

IQ5_K 157.771 GiB (5.926 BPW)

PPL over 552 chunks for n_ctx=512 = 8.4860 +/- 0.06815

<details> <summary>πŸ‘ˆ Secret Recipe</summary>
custom="
# 61 Repeating Layers [0-61]

# Attention [0-61] GPU
blk\..*\.attn_q.*=q8_0
blk\..*\.attn_k.*=q8_0
blk\..*\.attn_v.*=q8_0
blk\..*\.attn_output.*=q8_0

# Routed Experts Layers [0-61] CPU
blk\..*\.ffn_down_exps\.weight=iq6_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq5_k

# Non-Repeating Layers
token_embd\.weight=q8_0
output\.weight=q8_0
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/MiniMax-M2.5-GGUF/imatrix-MiniMax-M2.5-BF16.dat \
    /mnt/data/models/ubergarm/MiniMax-M2.5-GGUF/MiniMax-M2.5-256x4.9B-BF16-00001-of-00010.gguf \
    /mnt/data/models/ubergarm/MiniMax-M2.5-GGUF/MiniMax-M2.5-IQ5_K.gguf \
    IQ5_K \
    128
</details>

IQ4_XS 114.842 GiB (4.314 BPW)

PPL over 552 chunks for n_ctx=512 = 8.5702 +/- 0.06901

This is the only quant in this collection that is compatible with mainline llama.cpp. ik_llama.cpp can run all of them.

<details> <summary>πŸ‘ˆ Secret Recipe</summary>
#!/usr/bin/env bash

custom="
# 61 Repeating Layers [0-61]

# Attention [0-61] GPU
blk\..*\.attn_q.*=q8_0
blk\..*\.attn_k.*=q8_0
blk\..*\.attn_v.*=q8_0
blk\..*\.attn_output.*=q8_0

# Routed Experts Layers [0-61] CPU
blk\..*\.ffn_down_exps\.weight=iq4_xs
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_xs

# Non-Repeating Layers
token_embd\.weight=q4_K
output\.weight=q6_K
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/MiniMax-M2.5-GGUF/imatrix-MiniMax-M2.5-BF16.dat \
    /mnt/data/models/ubergarm/MiniMax-M2.5-GGUF/MiniMax-M2.5-256x4.9B-BF16-00001-of-00010.gguf \
    /mnt/data/models/ubergarm/MiniMax-M2.5-GGUF/MiniMax-M2.5-IQ4_XS.gguf \
    IQ4_XS \
    128
</details>

smol-IQ3_KS 87.237 GiB (3.277 BPW)

PPL over 552 chunks for n_ctx=512 = 8.7539 +/- 0.07075

<details> <summary>πŸ‘ˆ Secret Recipe</summary>
#!/usr/bin/env bash

custom="
# 61 Repeating Layers [0-61]

# Attention [0-61] GPU
blk\..*\.attn_q.*=q8_0
blk\..*\.attn_k.*=q8_0
blk\..*\.attn_v.*=q8_0
blk\..*\.attn_output.*=q8_0

# Routed Experts Layers [0-61] CPU
blk\..*\.ffn_down_exps\.weight=iq3_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq3_ks

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/MiniMax-M2.5-GGUF/imatrix-MiniMax-M2.5-BF16.dat \
    /mnt/data/models/ubergarm/MiniMax-M2.5-GGUF/MiniMax-M2.5-256x4.9B-BF16-00001-of-00010.gguf \
    /mnt/data/models/ubergarm/MiniMax-M2.5-GGUF/MiniMax-M2.5-smol-IQ3_KS.gguf \
    IQ3_KS \
    128
</details>

IQ2_KS 69.800 GiB (2.622 BPW)

PPL over 552 chunks for n_ctx=512 = 9.6827 +/- 0.07972

<details> <summary>πŸ‘ˆ Secret Recipe</summary>
#!/usr/bin/env bash

custom="
# 61 Repeating Layers [0-61]

# Attention [0-61] GPU
blk\..*\.attn_q.*=q8_0
blk\..*\.attn_k.*=q8_0
blk\..*\.attn_v.*=q8_0
blk\..*\.attn_output.*=q8_0

# Routed Experts Layers [0-61] CPU
blk\..*\.ffn_down_exps\.weight=iq3_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_ks

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/MiniMax-M2.5-GGUF/imatrix-MiniMax-M2.5-BF16.dat \
    /mnt/data/models/ubergarm/MiniMax-M2.5-GGUF/MiniMax-M2.5-256x4.9B-BF16-00001-of-00010.gguf \
    /mnt/data/models/ubergarm/MiniMax-M2.5-GGUF/MiniMax-M2.5-IQ2_KS.gguf \
    IQ2_KS \
    128
</details>

Quick Start

# Clone and checkout
$ git clone https://github.com/ikawrakow/ik_llama.cpp
$ cd ik_llama.cpp

# Build for hybrid CPU+CUDA
$ cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON
$ cmake --build build --config Release -j $(nproc)

# Hybrid CPU and Single GPU
echo TODO or look at my Step-3.5-Flash for rough example for now using --cpu-moe or --n-cpu-moe XX etc

# Hybrid CPU and Multi GPU 128k context full offload in 96GB VRAM
model=MiniMax-M2.5-IQ2_KS-00001-of-00003.gguf
_GLIBCXX_REGEX_STATE_LIMIT=1000000 \
CUDA_VISIBLE_DEVICES="0,1" \
./build/bin/llama-sweep-bench \
    --model "$model" \
    --alias ubergarm/MiniMax-M2.5 \
    -khad -ctk q6_0 -ctv q8_0 \
    -c 131072 \
    -ger \
    -sm graph \
    -ngl 99 \
    -ub 4096 -b 4096 \
    -ts 47,48 \
    --threads 1 \
    --host 127.0.0.1 \
    --port 8080 \
    --no-mmap \
    --jinja

# CPU-Only
numactl -N "$SOCKET" -m "$SOCKET" \
./build/bin/llama-server \
    --model "$model"\
    --alias ubergarm/MiniMax-M2.5 \
    --ctx-size 65536 \
    -ger \
    --merge-qkv \
    -ctk q8_0 -ctv q8_0 \
    -ub 4096 -b 4096 \
    --parallel 1 \
    --threads 96 \
    --threads-batch 128 \
    --numa numactl \
    --host 127.0.0.1 \
    --port 8080 \
    --no-mmap \
    --jinja

My own early testing with opencode suggests that even the smol-IQ3_KS is working okay with tool calling etc!

For tool use you can always bring your own template with --chat-template-file myTemplate.jinja and might need --special etc.

References

Author: ubergarm

Likes: 14

Downloads: 0

Tags: gguf, imatrix, conversational, minimax_m2, ik_llama.cpp, text-generation, base_model:MiniMaxAI/MiniMax-M2.5, base_model:quantized:MiniMaxAI/MiniMax-M2.5, endpoints_compatible, region:us

ox-ox/MiniMax-M2.5-GGUF


license: mit base_model: MiniMaxAI/MiniMax-M2.5 tags:

  • gguf
  • moe
  • minimax
  • llama.cpp
  • applesilicon
  • reasoning
  • conversational model_creator: MiniMaxAI pipeline_tag: text-generation

MiniMax-M2.5-GGUF (230B MoE)

High-precision GGUF quants of the MiniMax-M2.5 (230B parameters) Mixture of Experts model. These versions are specifically optimized for local inference on high-RAM setups, particularly Apple Silicon (M3 Max/Ultra).

πŸ”¬ Perplexity Validation (WikiText-2):

Final PPL: 8.2213 +/- 0.09

Context: 4096 / 32 chunks

Outcome: The Q3_K_L quantization maintains high logical coherence while boosting speed to 28.7 t/s. Minimal degradation for a ~20GB size reduction vs Q4.

πŸš€ Available Quants

| File Name | Method | Size | Use Case | | :--- | :--- | :--- | :--- | | minimax-m2.5-Q4_K_M.gguf | Q4_K_M | 138 GB | Highest logic preservation. Requires >128GB RAM or SSD swap. | | minimax-m2.5-Q3_K_L.gguf | Q3_K_L | ~110 GB | Sweet spot for 128GB Macs. Runs natively in RAM with high t/s ( 28 ON MAC M3 MAX ). |

πŸ›  Model Details

  • Architecture: MiniMax-M2 (Mixture of Experts) with 256 experts (8 active per token).
  • Parameters: ~230B total.
  • Quantization Process: Unlike automated scripts, these quants were generated from a full F16 GGUF Master (457GB) to minimize accumulation of errors during the K-Quant process.
  • Context Window: Up to 196k tokens (Native support).
  • Chat Template: Includes the official Jinja template for proper handling of interleaved <think> tags, separating reasoning from the final response.

πŸ’» Usage

Requires llama.cpp build 8022 or higher.

Command Line Example:

./llama-cli -m minimax-m2.5-Q3_K_L.gguf -n -1 \\
  -c 262000 \\
  -ngl 99 -fa on -ctk q4_0 -ctv q4_0 -b 2048 -ub 1024 --port 8080 --jinja --verbose -sm none --draft 16 -ncmoe 0 --cache-reuse 1024 --draft-p-min 0.5

Author: ox-ox

Likes: 9

Downloads: 0

Tags: gguf, moe, minimax, llama.cpp, applesilicon, reasoning, conversational, text-generation, base_model:MiniMaxAI/MiniMax-M2.5, base_model:quantized:MiniMaxAI/MiniMax-M2.5, license:mit, endpoints_compatible, region:us

DevQuasar/MiniMaxAI.MiniMax-M2.5-GGUF


base_model:

  • MiniMaxAI/MiniMax-M2.5 pipeline_tag: text-generation

<img src="https://raw.githubusercontent.com/csabakecskemeti/devquasar/main/dq_logo_black-transparent.png" width="200"/>

'Make knowledge free for everyone'

Quantized version of: MiniMaxAI/MiniMax-M2.5 <a href='https://ko-fi.com/L4L416YX7C' target='_blank'><img height='36' style='border:0px;height:36px;' src='https://storage.ko-fi.com/cdn/kofi6.png?v=6' border='0' alt='Buy Me a Coffee at ko-fi.com' /></a>

Author: DevQuasar

Likes: 5

Downloads: 0

Tags: gguf, text-generation, base_model:MiniMaxAI/MiniMax-M2.5, base_model:quantized:MiniMaxAI/MiniMax-M2.5, endpoints_compatible, region:us, conversational

OccultAI/Morpheus-8B-v1


license: apache-2.0 base_model:

  • SicariusSicariiStuff/Llama-3.1-Nemotron-8B-UltraLong-1M-Instruct_Abliterated tags:
  • finetune
  • llama
  • occult
  • uncensored datasets:
  • OccultAI/Morpheus_665 language:
  • en library_name: transformers widget:
    • text: "Morpheus 8B v1" output: url: https://cdn-uploads.huggingface.co/production/uploads/68e840caa318194c44ec2a04/dZ-Q05tEZVM-W3nFIab6U.png

[!CAUTION] <span style="color:red; font-weight:bold">⚠️ Warning:</span> This model can produce narratives and RP that contain violent and graphic erotic content. Adjust your system prompt accordingly, and use Llama 3 chat template.

Morpheus 8B v1

Recommended Settings: Temp 1.0, TopNSigma 1.25

Morpheus

{'loss': 0.5651, 'grad_norm': 3.787045478820801, 'learning_rate': 1.0386570913148586e-05, 'entropy': 0.7085917145013809, 'num_tokens': 588832.0, 'mean_token_accuracy': 0.8508399426937103, 'epoch': 4.0}

| Model | Q0 Score | Quant | Q0G | Refusals | | :--- | :--- | :--- | :--- | :--- | | Morpheus 8B v1 | 15365 | Q6_K | Pass | 0/100 |

Author: OccultAI

Likes: 4

Downloads: 0

Tags: transformers, safetensors, llama, text-generation, finetune, occult, uncensored, conversational, en, dataset:OccultAI/Morpheus_665, base_model:SicariusSicariiStuff/Llama-3.1-Nemotron-8B-UltraLong-1M-Instruct_Abliterated, base_model:finetune:SicariusSicariiStuff/Llama-3.1-Nemotron-8B-UltraLong-1M-Instruct_Abliterated, license:apache-2.0, text-generation-inference, endpoints_compatible, region:us

mlx-community/MiniMax-M2.5-3bit


pipeline_tag: text-generation license: other license_name: modified-mit license_link: https://github.com/MiniMax-AI/MiniMax-M2.5/blob/main/LICENSE library_name: mlx base_model: MiniMaxAI/MiniMax-M2.5 tags:

  • mlx

mlx-community/MiniMax-M2.5-3bit

This model mlx-community/MiniMax-M2.5-3bit was converted to MLX format from MiniMaxAI/MiniMax-M2.5 using mlx-lm version 0.30.7.

Use with mlx

pip install mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("mlx-community/MiniMax-M2.5-3bit")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True, return_dict=False,
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)

Author: mlx-community

Likes: 3

Downloads: 0

Tags: mlx, safetensors, minimax_m2, text-generation, conversational, custom_code, base_model:MiniMaxAI/MiniMax-M2.5, base_model:quantized:MiniMaxAI/MiniMax-M2.5, license:other, 3-bit, region:us

HeartMuLa/HeartMuLa-oss-3B-happy-new-year


license: apache-2.0 language:

  • zh
  • en
  • ja
  • ko
  • es pipeline_tag: text-to-audio tags:
  • music
  • art

Model Details

The best open-sourced music generation model in terms of lyrics controllability and music quality.

Model Description

<!-- Provide a longer summary of what this model is. -->
  • Developed by: [HeartMuLa Team]
  • License: [Apache 2.0]

Links

<!-- Provide the basic links for the model. -->
  • Github Repo: https://github.com/HeartMuLa/heartlib
  • Paper: https://arxiv.org/abs/2601.10547
  • Demo: https://heartmula.github.io/
  • HeartMuLa-oss-3B-happy-new-year (recommended) https://huggingface.co/HeartMuLa/HeartMuLa-oss-3B-happy-new-year
  • HeartMuLa-oss-3B: https://huggingface.co/HeartMuLa/HeartMuLa-oss-3B
  • HeartCodec-oss-20260123 (recommended) https://huggingface.co/HeartMuLa/HeartCodec-oss-20260123
  • HeartCodec-oss: https://huggingface.co/HeartMuLa/HeartCodec-oss
  • HeartTranscriptor-oss: https://huggingface.co/HeartMuLa/HeartTranscriptor-oss

Get Started

Check our github repo https://github.com/HeartMuLa/heartlib for a quickstart and local deployment of HeartMuLa.

Citation

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

If you find HeartMuLa useful, please cite:

@misc{yang2026heartmulafamilyopensourced,
      title={HeartMuLa: A Family of Open Sourced Music Foundation Models}, 
      author={Dongchao Yang and Yuxin Xie and Yuguo Yin and Zheyu Wang and Xiaoyu Yi and Gongxi Zhu and Xiaolong Weng and Zihan Xiong and Yingzhe Ma and Dading Cong and Jingliang Liu and Zihang Huang and Jinghan Ru and Rongjie Huang and Haoran Wan and Peixu Wang and Kuoxi Yu and Helin Wang and Liming Liang and Xianwei Zhuang and Yuanyuan Wang and Haohan Guo and Junjie Cao and Zeqian Ju and Songxiang Liu and Yuewen Cao and Heming Weng and Yuexian Zou},
      year={2026},
      eprint={2601.10547},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2601.10547}, 
}

Contact

If you are interested in HeartMuLa, feel free to reach us at heartmula.ai@gmail.com

Author: HeartMuLa

Likes: 3

Downloads: 0

Tags: safetensors, heartmula, music, art, text-to-audio, zh, en, ja, ko, es, arxiv:2601.10547, license:apache-2.0, region:us

DarkArtsForge/Raven-8B-v1


license: apache-2.0 base_model:

  • SicariusSicariiStuff/Llama-3.1-Nemotron-8B-UltraLong-1M-Instruct_Abliterated tags:
  • finetune
  • llama
  • raven
  • poe
  • gothic
  • horror
  • creative writing
  • RP datasets:
  • DarkArtsForge/Poe_v1 language:
  • en library_name: transformers widget:
    • text: "Raven 8B v1" output: url: https://cdn-uploads.huggingface.co/production/uploads/68e840caa318194c44ec2a04/gPKk1RgcW0QN0NAVpf4lh.jpeg

[!CAUTION] <span style="color:red; font-weight:bold">⚠️ Warning:</span> This model can produce narratives and RP that contain violent and graphic erotic content. Adjust your system prompt accordingly, and use Llama 3 chat template.

Raven 8B v1

A fully uncensored finetune of Llama-3.1-Nemotron-8B trained on a small dataset of Edgar Allan Poe corpus. Cooked for 5 epochs using PMPF.

{'loss': 0.1136, 'grad_norm': 1.0182174444198608, 'learning_rate': 1.685173482438018e-08, 'entropy': 0.18156841583549976, 'num_tokens': 99475.0, 'mean_token_accuracy': 0.9738506525754929, 'epoch': 5.0}
{'train_runtime': 590.173, 'train_samples_per_second': 0.847, 'train_steps_per_second': 0.212, 'train_loss': 1.036527609705925, 'epoch': 5.0}

raven11

Author: DarkArtsForge

Likes: 3

Downloads: 0

Tags: transformers, safetensors, llama, text-generation, finetune, raven, poe, gothic, horror, creative writing, RP, conversational, en, dataset:DarkArtsForge/Poe_v1, base_model:SicariusSicariiStuff/Llama-3.1-Nemotron-8B-UltraLong-1M-Instruct_Abliterated, base_model:finetune:SicariusSicariiStuff/Llama-3.1-Nemotron-8B-UltraLong-1M-Instruct_Abliterated, license:apache-2.0, text-generation-inference, endpoints_compatible, region:us

acvlab/FantasyWorld-Wan2.1-I2V-14B-480P


license: apache-2.0 language:

  • en
  • zh base_model:
  • Wan-AI/Wan2.1-I2V-14B-480P pipeline_tag: image-to-video tags:
  • wan
  • world_model

FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction

Home Page arXiv Code HuggingFace HuggingFace ModelScope ModelScope

πŸ”₯πŸ”₯πŸ”₯ Latest News

  • πŸ‘‹ Feb, 2026: We release the code and model weights of FantasyWorld.
  • πŸ› Jan, 2026: FantasyWorld is accepted by ICLR 2026.
  • πŸŽ‰ Dec, 2025: FantasyWorld ranked 1st on the WorldScore Leaderboard (by Stanford Prof. Fei-Fei Li's Team), validating our approach against global state-of-the-art models.

🌟 Overview

Overview

FantasyWorld is a unified feed-forward model for joint video and 3D scene generation. The front end employs Preconditioning Blocks (PCBs) that reuse the frozen WanDiT denoiser to supply partially denoised latents, ensuring the geometry pathway operates on meaningful features rather than pure noise. The backbone then consists of stacked Integrated Reconstruction and Generation (IRG) Blocks, which iteratively refine video latents and geometry features under multimodal conditioning. Each IRG block contains an asymmetric dual-branch structure: an Imagination Prior Branch for appearance synthesis and a Geometry-Consistent Branch for explicit 3D reasoning, coupled through lightweight adapters and cross attention.

πŸš€ Training Strategy

FantasyWorld leverages a robust two-stage training strategy to achieve jointly video and 3D generation:

  • Stage 1 (Geometry Pre-training): Utilizes a VGGT-style model for precise estimation of depth, point, and camera trajectories.
  • Stage 2 (Joint Generation): A unified model that seamlessly integrates the Stage 1 geometry backbone with the Wan video generation pipeline.

πŸ“¦ Model Zoo

We provide two versions of the model to cater to different research and application needs:

| Model Name | Description | | :--- | :--- | | FantasyWorld-Wan2.1-I2V-14B-480P | Reproducibility Focus: Strictly adheres to the original configurations detailed in our paper. Best for academic benchmarking and reproducing reported results. | | FantasyWorld-Wan2.2-Fun-A14B-Control-Camera | Performance Focus: Offers substantial enhancements, including an upgraded video foundation model, larger-scale training datasets, and higher output resolution. |

πŸš€ Quickstart

Installation

  1. Clone the repository
git clone https://github.com/Fantasy-AMAP/fantasy-world.git
cd fantasy-world
  1. Install dependencies
conda create -n fantasyworld python=3.10
conda activate fantasyworld
pip install -r requirements.txt
pip install thirdparty/utils3d/

1. FantasyWorld-Wan2.1-I2V-14B-480P

1.1 Model Download

| Models | Download Link | Notes | | --------------|-------------------------------------------------------------------------------|-------------------------------| | Wan2.1-I2V-14B-480P | πŸ€— Huggingface πŸ€– ModelScope | Base Model | FantasyWorld-Wan2.1-I2V-14B-480P | πŸ€— Huggingface πŸ€– ModelScope | FantasyWorld

Download models using huggingface:

pip install -U "huggingface_hub"
hf download "Wan-AI/Wan2.1-I2V-14B-480P" --local-dir ./models/Wan-AI/Wan2.1-I2V-14B-480P
hf download "acvlab/FantasyWorld-Wan2.1-I2V-14B-480P" --local-dir ./models/FantasyWorld-Wan2.1-I2V-14B-480P/

Download models using modelscope:

pip install -U modelscope
modelscope download "Wan-AI/Wan2.1-I2V-14B-480P" --local_dir ./models/Wan-AI/Wan2.1-I2V-14B-480P
modelscope download "amap_cvlab/FantasyWorld-Wan2.1-I2V-14B-480P" --local_dir ./models/FantasyWorld-Wan2.1-I2V-14B-480P/

1.2 Inference Command

python inference_wan21.py \
    --wan_ckpt_path ./models/Wan-AI/Wan2.1-I2V-14B-480P \
    --model_ckpt ./models/FantasyWorld-Wan2.1-I2V-14B-480P/model.pth \
    --image_path ./examples/images/input_image.png \
    --camera_json_path ./examples/cameras/camera_data.json \
    --prompt "In the Open Loft Living Room, sunlight streams through large windows, highlighting the sleek fireplace and elegant wooden stairs." \
    --output_dir ./output-wan21 \
    --sample_steps 50 \
    --using_scale True 

Parameter Description:

  • --wan_ckpt_path - Required: Directory containing the Wan model checkpoints
  • --model_ckpt - Required: Path to the trained model checkpoint
  • --image_path - Required: Path to the input image
  • --camera_json_path - Required: Camera json path
  • --prompt - Required: Text prompt
  • --output_dir - Optional: Output directory
  • --sample_steps - Optional: Number of sampling steps (default: 50)
  • --using_scale - Optional: Whether to use scale normalization (default: True)

2. FantasyWorld-Wan2.2-Fun-A14B-Control-Camera

2.1 Model Download

| Models | Download Link | Notes | | --------------|-------------------------------------------------------------------------------|-------------------------------| | Wan2.2-Fun-A14B-Control-Camera | πŸ€— Huggingface πŸ€– ModelScope | Base Model | Wan2.2-Fun-Reward-LoRAs | πŸ€— Huggingface πŸ€– ModelScope | LoRA Model | FantasyWorld-Wan2.2-Fun-A14B-Control-Camera | πŸ€— Huggingface πŸ€– ModelScope | FantasyWorld

Download models using huggingface:

pip install -U "huggingface_hub"
hf download "alibaba-pai/Wan2.2-Fun-A14B-Control-Camera" --local-dir ./models/PAI/Wan2.2-Fun-A14B-Control-Camera
hf download "alibaba-pai/Wan2.2-Fun-Reward-LoRAs" --local-dir ./models/PAI/Wan2.2-Fun-Reward-LoRAs
hf download "acvlab/FantasyWorld-Wan2.2-Fun-A14B-Control-Camera" --local-dir ./models/FantasyWorld-Wan2.2-Fun-A14B-Control-Camera/

Download models using modelscope:

pip install -U modelscope
modelscope download "PAI/Wan2.2-Fun-A14B-Control-Camera" --local_dir ./models/PAI/Wan2.2-Fun-A14B-Control-Camera
modelscope download "PAI/Wan2.2-Fun-Reward-LoRAs" --local_dir ./models/PAI/Wan2.2-Fun-Reward-LoRAs
modelscope download "amap_cvlab/FantasyWorld-Wan2.2-Fun-A14B-Control-Camera" --local_dir ./models/FantasyWorld-Wan2.2-Fun-A14B-Control-Camera/

2.2 Inference Command

python inference_wan22.py \
    --image_path ./examples/images/input_image.png \
    --end_image_path ./examples/images/end_image.png \
    --wan_ckpt_path ./models/ \
    --camera_json_path ./examples/cameras/camera_data.json \
    --prompt "In the Open Loft Living Room, sunlight streams through large windows, highlighting the sleek fireplace and elegant wooden stairs." \
    --model_ckpt_high ./models/FantasyWorld-Wan2.2-Fun-A14B-Control-Camera/high_noise_model.pth \
    --model_ckpt_low ./models/FantasyWorld-Wan2.2-Fun-A14B-Control-Camera/low_noise_model.pth \
    --output_dir ./output-wan22 \
    --sample_steps 50 \
    --using_scale True

Parameter Description:

  • --image_path - Required: Path to the first image
  • --end_image_path - Required: Path to the end image
  • --wan_ckpt_path - Required: Directory containing the Wan model checkpoints
  • --camera_json_path - Required: Camera json path, corresponding to examples/cameras/camera_data_*.json
  • --prompt - Required: Text prompt
  • --model_ckpt_high - Required: Path to the trained high model checkpoint
  • --model_ckpt_low - Required: Path to the trained low model checkpoint
  • --output_dir - Optional: Output directory
  • --sample_steps - Optional: Number of sampling steps (default: 50)
  • --using_scale - Optional: Whether to use scale normalization (default: True)

🧩 Community Contributions

We ❀️ contributions from the open-source community! If your work has improved FantasyWorld, please inform us. Or you can directly e-mail frank.jf@alibaba-inc.com. We are happy to reference your project for everyone's convenience.

πŸ”—Citation

If you find this repository useful, please consider giving a star ⭐ and citation.

@inproceedings{
    dai2025fantasyworld,
    title={FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction},
    author={Yixiang Dai and Fan Jiang and Chiyu Wang and Mu Xu and Yonggang Qi},
    booktitle={The Fourteenth International Conference on Learning Representations},
    year={2026},
    url={https://openreview.net/forum?id=3q9vHEqsNx}
}

πŸ™ Acknowledgments

We would like to thank Wan, VideoX-Fun, DiffSynth-Studio and VGGT for their great works.

Author: acvlab

Likes: 3

Downloads: 0

Tags: wan, world_model, image-to-video, en, zh, arxiv:2509.21657, base_model:Wan-AI/Wan2.1-I2V-14B-480P, base_model:finetune:Wan-AI/Wan2.1-I2V-14B-480P, license:apache-2.0, region:us

HauhauCS/GPTOSS-120B-Uncensored-HauhauCS-Aggressive


license: apache-2.0 base_model: openai/gpt-oss-120b tags:

  • uncensored
  • abliterated
  • gguf
  • mxfp4
  • moe
  • gpt-oss language:
  • en

GPTOSS-120B-Uncensored-HauhauCS-Aggressive

Uncensored version of GPT-OSS 120B by OpenAI. This is the aggressive variant - tuned harder for fewer refusals.

No changes to datasets or capabilities. Fully functional, 100% of what the original authors intended - just without the refusals.

Format

MXFP4 GGUF. This is the model's native precision - GPT-OSS was trained in MXFP4, so no further quantization is needed or recommended. Re-quantizing would only lose quality.

Works with llama.cpp, LM Studio, Ollama, and anything else that loads GGUFs.

Downloads

| File | Size | |------|------| | GPTOSS-120B-Uncensored-HauhauCS-Aggressive-MXFP4.gguf | 61 GB |

Specs

  • 117B total parameters, ~5.1B active per forward pass (MoE: 128 experts, top-4 routing)
  • 128K context
  • Based on openai/gpt-oss-120b

Recommended Settings

  • temperature: 1.0
  • top_k: 40
  • Everything else (top_p, min_p, repeat penalty, etc.) should be disabled - some clients enable these by default, turn them off

Required flag: --jinja to enable the Harmony response format (the model won't work correctly without it).

For llama.cpp:

llama-server -m model.gguf --jinja -fa -b 2048 -ub 2048

LM Studio

Compatible with Reasoning Effort custom buttons. To use them, put the model in:

LM Models\lmstudio-community\gpt-oss-120b-GGUF\

Hardware

Fits in ~61GB VRAM. Single H100 or equivalent. For lower VRAM, use --n-cpu-moe N in llama.cpp to offload MoE layers to CPU.

Author: HauhauCS

Likes: 2

Downloads: 0

Tags: gguf, uncensored, abliterated, mxfp4, moe, gpt-oss, en, base_model:openai/gpt-oss-120b, base_model:quantized:openai/gpt-oss-120b, license:apache-2.0, endpoints_compatible, region:us, conversational