Todays AI Summary

AI Insights: Pruning for Efficiency, Enhanced Reasoning, and OCR Advancements

Today's AI landscape showcases advancements in model efficiency, reasoning capabilities, and optical character recognition (OCR). Research papers explore methods to improve LLM reasoning and world model learning, while new models focus on compression, video generation, and document understanding.

Research Highlights

  • Semantic World Models: A paper introduces a novel approach to world modeling for robotics, framing it as a visual question answering problem. This allows vision language models to be trained as "semantic" world models, improving policy improvement on open-ended robotics tasks.
  • Scaf-GRPO: Researchers present Scaf-GRPO, a progressive training framework that strategically provides minimal guidance to LLMs when independent learning plateaus. Experiments on mathematics benchmarks demonstrate its effectiveness, boosting the pass@1 score of the Qwen2.5-Math-7B model on the AIME24 benchmark by a relative 44.3% over a vanilla GRPO baseline.
  • PROBE: A new benchmark, PROBE, is introduced to evaluate proactive problem-solving in LLM agents. It decomposes proactivity into searching for issues, identifying bottlenecks, and executing resolutions. Results show that even state-of-the-art models struggle with this benchmark, highlighting limitations in autonomous action.
  • AdaSPEC: A novel method, AdaSPEC, incorporates selective token filtering into the knowledge distillation process for speculative decoding. By filtering out difficult-to-fit tokens, the distillation of a draft model better aligns with the target model on simpler tokens, improving the overall token acceptance rate.

Model Releases

  • cerebras/GLM-4.6-REAP-218B-A32B-FP8: Cerebras introduces a memory-efficient compressed variant of GLM-4.6-FP8, achieved through Router-weighted Expert Activation Pruning (REAP). This model maintains near-identical performance while being 40% lighter, making it suitable for resource-constrained environments. It achieves strong results on coding benchmarks such as HumanEval and MBPP.
  • spooknik/CyberRealistic-Flux-SVDQ: This repository offers Nunchaku-quantized versions of CyberRealistic Flux, a text-to-image model. The quantization methods, SVDQ INT4 and SVDQ NVFP4, provide options for users with different GPU architectures.
  • artificialguybr/FollowCam-Redmond-WAN2-T2V-14B: A LoRA model designed for Wan-AI models, enabling the creation of immersive, continuous "follow camera" shots in text-to-video generation.
  • lightonai/LightOnOCR-0.9B-32k-1025: LightOnOCR-1B is a compact, end-to-end vision–language model for Optical Character Recognition (OCR) and document understanding. It achieves state-of-the-art accuracy in its weight class while being several times faster and cheaper than larger general-purpose VLMs.

Key Takeaways

  • Pruning for Efficiency: The release of GLM-4.6-REAP-218B-A32B-FP8 demonstrates the effectiveness of pruning techniques like REAP in compressing large language models without significant performance loss.
  • Reasoning Enhancements: The Scaf-GRPO paper highlights the importance of strategic guidance in improving LLM reasoning capabilities, particularly for complex tasks.
  • OCR Advancements: LightOnOCR-0.9B-32k-1025 showcases the potential of compact vision-language models for efficient and accurate document understanding.
  • Video Generation: The FollowCam LoRA model introduces a novel approach to text-to-video generation, creating immersive and dynamic "follow camera" perspectives.

AI Papers for 2026-03-06

A Dual-Helix Governance Approach Towards Reliable Agentic AI for WebGIS Development

WebGIS development requires rigor, yet agentic AI frequently fails due to five large language model (LLM) limitations: context constraints, cross-session forgetting, stochasticity, instruction failure, and adaptation rigidity. We propose a dual-helix governance framework reframing these challenges as structural governance problems that model capacity alone cannot resolve. We implement the framework as a 3-track architecture (Knowledge, Behavior, Skills) that uses a knowledge graph substrate to stabilize execution by externalizing domain facts and enforcing executable protocols, complemented by a self-learning cycle for autonomous knowledge growth. Applying this to the FutureShorelines WebGIS tool, a governed agent refactored a 2,265-line monolithic codebase into modular ES6 components. Results demonstrated a 51\% reduction in cyclomatic complexity and a 7-point increase in maintainability index. A comparative experiment against a zero-shot LLM confirms that externalized governance, not just model capability, drives operational reliability in geospatial engineering. This approach is implemented in the open-source AgentLoom governance toolkit.

ZipMap: Linear-Time Stateful 3D Reconstruction with Test-Time Training

Feed-forward transformer models have driven rapid progress in 3D vision, but state-of-the-art methods such as VGGT and $π^3$ have a computational cost that scales quadratically with the number of input images, making them inefficient when applied to large image collections. Sequential-reconstruction approaches reduce this cost but sacrifice reconstruction quality. We introduce ZipMap, a stateful feed-forward model that achieves linear-time, bidirectional 3D reconstruction while matching or surpassing the accuracy of quadratic-time methods. ZipMap employs test-time training layers to zip an entire image collection into a compact hidden scene state in a single forward pass, enabling reconstruction of over 700 frames in under 10 seconds on a single H100 GPU, more than $20\times$ faster than state-of-the-art methods such as VGGT. Moreover, we demonstrate the benefits of having a stateful representation in real-time scene-state querying and its extension to sequential streaming reconstruction.

Robustness of Agentic AI Systems via Adversarially-Aligned Jacobian Regularization

As Large Language Models (LLMs) transition into autonomous multi-agent ecosystems, robust minimax training becomes essential yet remains prone to instability when highly non-linear policies induce extreme local curvature in the inner maximization. Standard remedies that enforce global Jacobian bounds are overly conservative, suppressing sensitivity in all directions and inducing a large Price of Robustness. We introduce Adversarially-Aligned Jacobian Regularization (AAJR), a trajectory-aligned approach that controls sensitivity strictly along adversarial ascent directions. We prove that AAJR yields a strictly larger admissible policy class than global constraints under mild conditions, implying a weakly smaller approximation gap and reduced nominal performance degradation. Furthermore, we derive step-size conditions under which AAJR controls effective smoothness along optimization trajectories and ensures inner-loop stability. These results provide a structural theory for agentic robustness that decouples minimax stability from global expressivity restrictions.

$τ$-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge

Conversational agents are increasingly deployed in knowledge-intensive settings, where correct behavior depends on retrieving and applying domain-specific knowledge from large, proprietary, and unstructured corpora during live interactions with users. Yet most existing benchmarks evaluate retrieval or tool use independently of each other, creating a gap in realistic, fully agentic evaluation over unstructured data in long-horizon interactions. We introduce $τ$-Knowledge, an extension of $τ$-Bench for evaluating agents in environments where success depends on coordinating external, natural-language knowledge with tool outputs to produce verifiable, policy-compliant state changes. Our new domain, $τ$-Banking, models realistic fintech customer support workflows in which agents must navigate roughly 700 interconnected knowledge documents while executing tool-mediated account updates. Across embedding-based retrieval and terminal-based search, even frontier models with high reasoning budgets achieve only $\sim$25.5% pass^1, with reliability degrading sharply over repeated trials. Agents struggle to retrieve the correct documents from densely interlinked knowledge bases and to reason accurately over complex internal policies. Overall, $τ$-Knowledge provides a realistic testbed for developing agents that integrate unstructured knowledge in human-facing deployments.

Low-Resource Guidance for Controllable Latent Audio Diffusion

Generative audio requires fine-grained controllable outputs, yet most existing methods require model retraining on specific controls or inference-time controls (\textit{e.g.}, guidance) that can also be computationally demanding. By examining the bottlenecks of existing guidance-based controls, in particular their high cost-per-step due to decoder backpropagation, we introduce a guidance-based approach through selective TFG and Latent-Control Heads (LatCHs), which enables controlling latent audio diffusion models with low computational overhead. LatCHs operate directly in latent space, avoiding the expensive decoder step, and requiring minimal training resources (7M parameters and $\approx$ 4 hours of training). Experiments with Stable Audio Open demonstrate effective control over intensity, pitch, and beats (and a combination of those) while maintaining generation quality. Our method balances precision and audio fidelity with far lower computational costs than standard end-to-end guidance. Demo examples can be found at https://zacharynovack.github.io/latch/latch.html.

Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks

Multimodal web agents that process both screenshots and accessibility trees are increasingly deployed to interact with web interfaces, yet their dual-stream architecture opens an underexplored attack surface: an adversary who injects content into the webpage DOM simultaneously corrupts both observation channels with a consistent deceptive narrative. Our vulnerability analysis on MiniWob++ reveals that attacks including a visual component far outperform text-only injections, exposing critical gaps in text-centric VLM safety training. Motivated by this finding, we propose Dual-Modality Multi-Stage Adversarial Safety Training (DMAST), a framework that formalizes the agent-attacker interaction as a two-player zero-sum Markov game and co-trains both players through a three-stage pipeline: (1) imitation learning from a strong teacher model, (2) oracle-guided supervised fine-tuning that uses a novel zero-acknowledgment strategy to instill task-focused reasoning under adversarial noise, and (3) adversarial reinforcement learning via Group Relative Policy Optimization (GRPO) self-play. On out-of-distribution tasks, DMAST substantially mitigates adversarial risks while simultaneously doubling task completion efficiency. Our approach significantly outperforms established training-based and prompt-based defenses, demonstrating genuine co-evolutionary progress and robust generalization to complex, unseen environments.

Dissecting Quantization Error: A Concentration-Alignment Perspective

Quantization can drastically increase the efficiency of large language and vision models, but typically incurs an accuracy drop. Recently, function-preserving transforms (e.g. rotations, Hadamard transform, channel-wise scaling) have been successfully applied to reduce post-training quantization error, yet a principled explanation remains elusive. We analyze linear-layer quantization via the signal-to-quantization-noise ratio (SQNR), showing that for uniform integer quantization at a fixed bit width, SQNR decomposes into (i) the concentration of weights and activations (capturing spread and outliers), and (ii) the alignment of their dominant variation directions. This reveals an actionable insight: beyond concentration - the focus of most prior transforms (e.g. rotations or Hadamard) - improving alignment between weight and activation can further reduce quantization error. Motivated by this, we introduce block Concentration-Alignment Transforms (CAT), a lightweight linear transformation that uses a covariance estimate from a small calibration set to jointly improve concentration and alignment, approximately maximizing SQNR. Experiments across several LLMs show that CAT consistently matches or outperforms prior transform-based quantization methods at 4-bit precision, confirming the insights gained in our framework.

RoboCasa365: A Large-Scale Simulation Framework for Training and Benchmarking Generalist Robots

Recent advances in robot learning have accelerated progress toward generalist robots that can perform everyday tasks in human environments. Yet it remains difficult to gauge how close we are to this vision. The field lacks a reproducible, large-scale benchmark for systematic evaluation. To fill this gap, we present RoboCasa365, a comprehensive simulation benchmark for household mobile manipulation. Built on the RoboCasa platform, RoboCasa365 introduces 365 everyday tasks across 2,500 diverse kitchen environments, with over 600 hours of human demonstration data and over 1600 hours of synthetically generated demonstration data -- making it one of the most diverse and large-scale resources for studying generalist policies. RoboCasa365 is designed to support systematic evaluations for different problem settings, including multi-task learning, robot foundation model training, and lifelong learning. We conduct extensive experiments on this benchmark with state-of-the-art methods and analyze the impacts of task diversity, dataset scale, and environment variation on generalization. Our results provide new insights into what factors most strongly affect the performance of generalist robots and inform strategies for future progress in the field.

Efficient Refusal Ablation in LLM through Optimal Transport

Safety-aligned language models refuse harmful requests through learned refusal behaviors encoded in their internal representations. Recent activation-based jailbreaking methods circumvent these safety mechanisms by applying orthogonal projections to remove refusal directions, but these approaches treat refusal as a one-dimensional phenomenon and ignore the rich distributional structure of model activations. We introduce a principled framework based on optimal transport theory that transforms the entire distribution of harmful activations to match harmless ones. By combining PCA with closed-form Gaussian optimal transport, we achieve efficient computation in high-dimensional representation spaces while preserving essential geometric structure. Across six models (Llama-2, Llama-3.1, Qwen-2.5; 7B-32B parameters), our method achieves up to 11% higher attack success rates than state-of-the-art baselines while maintaining comparable perplexity, demonstrating superior preservation of model capabilities. Critically, we discover that layer-selective intervention (applying optimal transport to 1-2 carefully chosen layers at approximately 40-60% network depth) substantially outperforms full-network interventions, revealing that refusal mechanisms may be localized rather than distributed. Our analysis provides new insights into the geometric structure of safety representations and suggests that current alignment methods may be vulnerable to distributional attacks beyond simple direction removal.

RANGER: Sparsely-Gated Mixture-of-Experts with Adaptive Retrieval Re-ranking for Pathology Report Generation

Pathology report generation remains a relatively under-explored downstream task, primarily due to the gigapixel scale and complex morphological heterogeneity of Whole Slide Images (WSIs). Existing pathology report generation frameworks typically employ transformer architectures, relying on a homogeneous decoder architecture and static knowledge retrieval integration. Such architectures limit generative specialization and may introduce noisy external guidance during the report generation process. To address these limitations, we propose RANGER, a sparsely-gated Mixture-of-Experts (MoE) framework with adaptive retrieval re-ranking for pathology report generation. Specifically, we integrate a sparsely gated MoE into the decoder, along with noisy top-$k$ routing and load-balancing regularization, to enable dynamic expert specialization across various diagnostic patterns. Additionally, we introduce an adaptive retrieval re-ranking module that selectively refines retrieved memory from a knowledge base before integration, reducing noise and improving semantic alignment based on visual feature representations. We perform extensive experiments on the PathText-BRCA dataset and demonstrate consistent improvements over existing approaches across standard natural language generation metrics. Our full RANGER model achieves optimal performance on PathText dataset, reaching BLEU-1 to BLEU-4 scores of 0.4598, 0.3044, 0.2036, and 0.1435, respectively, with METEOR of 0.1883, and ROUGE-L of 0.3038, validating the effectiveness of dynamic expert routing and adaptive knowledge refinement for semantically grounded pathology report generation.

AI Models

Kijai/LTX2.3_comfy


license: other license_name: ltx-2-community-license-agreement license_link: https://github.com/Lightricks/LTX-2/blob/main/LICENSE tags:

  • comfyui
  • diffusion-single-file

Separated LTX2.3 checkpoint for alternative way to load the models in Comfy

image

The fp8 quantizations were done with the basic static weight scales and are set to not run with fp8 matmuls, the models marked input_scaled additionally have activation scaling, and are set to run with fp8 matmuls on supported hardware.

Author: Kijai

Likes: 33

Downloads: 0

Tags: diffusion-single-file, comfyui, license:other, region:us

QuantStack/LTX-2.3-GGUF


language:

  • en
  • de
  • es
  • fr
  • ja
  • ko
  • zh
  • it
  • pt library_name: gguf license: other license_name: ltx-2-community-license-agreement license_link: https://github.com/Lightricks/LTX-2/blob/main/LICENSE pipeline_tag: image-to-video arxiv: 2601.03233 base_model:
  • Lightricks/LTX-2.3 tags:
  • image-to-video
  • text-to-video
  • video-to-video
  • image-text-to-video
  • audio-to-video
  • text-to-audio
  • video-to-audio
  • audio-to-audio
  • text-to-audio-video
  • image-to-audio-video
  • image-text-to-audio-video
  • ltx-2
  • ltx-2-3
  • ltx-video
  • ltxv
  • lightricks pinned: true demo: https://app.ltx.studio/ltx-2-playground/i2v

This GGUF file is a direct conversion of Lightricks/LTX-2.3

Type | Name | Location | Download | ------------ | -------------------------------------------------- | --------------------------------- | ------------------------- | Main Model | LTX-2.3 | ComfyUI/models/unet | GGUF (this repo) | Text Encoder | Gemma 3 | ComfyUI/models/text_encoders | Safetensors / GGUF | | Text Encoder | Text Projection | ComfyUI/models/text_encoders | Safetensors| | VAE Video / VAE Audio | vae | ComfyUI/models/vae | Safetensors |

Since this is a quantized model, all original licensing terms and usage restrictions remain in effect.

Author: QuantStack

Likes: 8

Downloads: 0

Tags: gguf, image-to-video, text-to-video, video-to-video, image-text-to-video, audio-to-video, text-to-audio, video-to-audio, audio-to-audio, text-to-audio-video, image-to-audio-video, image-text-to-audio-video, ltx-2, ltx-2-3, ltx-video, ltxv, lightricks, en, de, es, fr, ja, ko, zh, it, pt, base_model:Lightricks/LTX-2.3, base_model:quantized:Lightricks/LTX-2.3, license:other, region:us

dx8152/Flux2-Klein-9B-Consistency


license: apache-2.0 base_model:

  • black-forest-labs/FLUX.2-klein-9B pipeline_tag: image-to-image tags:
  • lora

LoRA can significantly improve Klein consistency without any cue words. Video tutorial: https://youtu.be/JXMbbbdfnSg

1-(4)

13acf8f94c75e9ef96615868da6f1f9f

ComfyUI_temp_ntuda_00011_

ComfyUI_temp_honxz_00002_

ComfyUI_temp_svpzc_00008_

ComfyUI_temp_zdsuk_00001_

ComfyUI_temp_zdsuk_00008_ ComfyUI_temp_zdsuk_00005_

Author: dx8152

Likes: 7

Downloads: 0

Tags: lora, image-to-image, base_model:black-forest-labs/FLUX.2-klein-9B, base_model:adapter:black-forest-labs/FLUX.2-klein-9B, license:apache-2.0, region:us

fal/virtual-tryoff-lora


tags:

  • flux.2
  • image-to-image
  • virtual-try-off
  • fal
  • lora
  • diffusers
  • template:diffusion-lora widget:
  • output: url: images/5.png text: >- TRYOFF extract the t-shirt over a white background, product photography style. NO HUMAN VISIBLE (the garments maintain their 3D form like an invisible mannequin).
  • output: url: images/3.png text: >- TRYOFF extract the dress over a white background, product photography style. NO HUMAN VISIBLE (the garments maintain their 3D form like an invisible mannequin).
  • output: url: images/2.png text: >- TRYOFF extract the pants over a white background, product photography style. NO HUMAN VISIBLE (the garments maintain their 3D form like an invisible mannequin).
  • output: url: images/6.png text: >- TRYOFF extract the outfit over a white background, product photography style. NO HUMAN VISIBLE (the garments maintain their 3D form like an invisible mannequin).
  • output: url: images/1.png text: >- TRYOFF extract the t-shirt over a white background, product photography style. NO HUMAN VISIBLE (the garments maintain their 3D form like an invisible mannequin).

- output:

url: images/4.png

text: >-

TRYOFF extract the upper body over a white background, product photography style.

NO HUMAN VISIBLE (the garments maintain their 3D form like an invisible mannequin).

  • output: url: images/7.png text: >- TRYOFF extract full outfit in the reference image over a white background, high-end professional product photography. Present the outfit as a complete, vertically stacked ensemble arranged as if worn. The items are stacked as if worn. The top-layer garment is dominant, followed directly by the bottom-layer garment. The footwear is placed below the bottom-layer hem, aligning with where the feet would naturally be. Lighting: Clean, even, diffused studio lighting (softbox or beauty dish style). The illumination must highlight all varying textures (e.g., pebble leather, suede, knit, or canvas) without creating harsh shadows. base_model: black-forest-labs/FLUX.2-klein-9B instance_prompt: TRYOFF license: apache-2.0 pipeline_tag: image-to-image

FLUX.2-klein-base-9B Virtual Try-Off LoRA

<Gallery />

Virtual Try-Off: Given an image of a person wearing clothing and a garment category prompt, the model generates a clean image of the garment as if it were photographed alone.
The model reconstructs the clothing item while preserving its style, texture, color, and design from the input image.
1 input image (person wearing clothes) + text category → 1 output garment image
Built with fal.ai.

Usage

Training

Trained with fal.ai trainer.

  • Base model: FLUX.2-klein-base-9B
  • Steps: 10000
  • Learning Rate: 0.00005
  • Dataset: 300 image pairs (model + garment) of shape 1024x1024

Author

Created by Riza Velioglu at fal.ai

Author: fal

Likes: 6

Downloads: 0

Tags: diffusers, flux.2, image-to-image, virtual-try-off, fal, lora, template:diffusion-lora, base_model:black-forest-labs/FLUX.2-klein-9B, base_model:adapter:black-forest-labs/FLUX.2-klein-9B, license:apache-2.0, region:us

RuneXX/LTX-2.3-Workflows


tags:

  • ltx
  • ltx-2
  • comfyui
  • comfy
  • gguf
  • ltx-video
  • ltx-2-3
  • ltxv
  • text-to-video
  • image-to-video
  • audio-to-video
  • video-to-video

The workflows are based on the extracted models from https://huggingface.co/Kijai/LTX2.3_comfy The extracted models might run easier on your computer (as separate files). (but you can easily swap out the model loader for the ComfyUI default model loader if you want to load the checkpoint with "all in one" vae built-in etc)

Model Downloads:

  • Main split models used in these workflows (LTX-2.3 dev & distilled safetensor, embeddings, audio and video vae): https://huggingface.co/Kijai/LTX2.3_comfy

  • Gemma 3 12B it safetensor: https://huggingface.co/Comfy-Org/ltx-2/

  • Gemma 3 12B it GGUF: https://huggingface.co/unsloth/gemma-3-12b-it-GGUF/

  • Optional LTX-2.3 GGUF models (for GGUF workflows):

  1. Quantstack: https://huggingface.co/QuantStack/LTX-2.3-GGUF
  2. Vantage : https://huggingface.co/vantagewithai/LTX-2.3-GGUF

Needed nodes:

  • https://github.com/kijai/ComfyUI-KJNodes (NB! Must be up to date for LTX-2 support)

  • https://github.com/city96/ComfyUI-GGUF (NB! Must be up to date for LTX-2 support)

  • ComfyUI itself must be updated to very latest


Lighttricks LTX-2.3 main repro: https://huggingface.co/Lightricks/LTX-2.3
Lightricks LTX-2.3 Collection (loras etc): https://huggingface.co/collections/Lightricks/ltx-23

Author: RuneXX

Likes: 5

Downloads: 0

Tags: ltx, ltx-2, comfyui, comfy, gguf, ltx-video, ltx-2-3, ltxv, text-to-video, image-to-video, audio-to-video, video-to-video, region:us

KORMo-VL/KORMo-VL-Diffusion


library_name: diffusers license: apache-2.0

<!-- <p align="center"> <img src="https://github.com/MLP-Lab/KORMo-tutorial/blob/main/tutorial/attachment/kormo_logo.png?raw=true" style="width: 100%; max-width: 1100px;"> </p> --> <p align="center"> <img src="https://github.com/MLP-Lab/KORMo-tutorial/blob/main/tutorial/attachment/kormo_logo.svg?raw=true" style="width: 40%; max-width: 1100px;"> </p>

🚀 Update News

  • 2026-03-05: Official release of KORMo-Diffusion.
  • 2026-03-02: Official release of KORMo-VL.
  • 2025-10-13: Official release of KORMo-10B-sft.

💡 About KORMo-VL-Diffusion

KORMo-VL is a vision-language model developed from scratch by the KAIST MLP Lab (https://sites.google.com/view/aailab), built on top of KORMo-10B. The system consists of two components:

  • Vision-Language Model (VLM)
  • Image Generation Model

The KORMo-VL-Diffusion model, designed for image generation, was trained from scratch with a high proportion of images reflecting Korean daily environments and culture. <span style="color:red">Unfortunately, due to limited GPU resources during the research process, we are sharing the intermediate results of the model at this stage.</span>


KORMo-VL은 KAIST MLP 연구실에서 from scratch로 개발한 시각-언어 모델로, KORMo-10B를 기반으로 (1) 시각-언어 모델과 (2) 이미지 생성 모델로 구성되어 있습니다.

이 중 이미지 생성을 위한 KORMo-VL-Diffusion 모델은 한국의 생활 환경과 문화를 반영하기 위해 국내 환경 이미지를 가능한 높은 비율로 사용하여 from scratch부터 학습된 모델입니다. <span style="color:red">다만 연구 진행 중 GPU 자원을 추가로 확보하지 못해 현재는 중간 결과물을 공유하게 되었습니다.</span>

  • LLM: KORMo-VL
  • Model Structure: Qwen-Image를 구조를 참조해 재개발함 (20B 정도의 Diffusion부분을 변형해 scratch부터 학습)
  • Languages: Korean / English
  • Training Data: Synthetic data + public datasets (e.g., AI Hub, details to be released)

향후 해당 모델을 충분히 학습할 수 있는 환경이 마련된다면 완성된 모델로 발전시키는 것을 목표로 하고 있습니다. 중간 결과물 위에서 추가 튜닝이나 연구를 진행하고 싶은 분들은 자유롭게 활용해 보시기 바랍니다.

📈 T2I Performance

English Prompt

| Prompt | Generated Image | | :--- | :--- | | Prompt: Dense forest | <img src="https://huggingface.co/KORMo-VL/KORMo-VL-Diffusion/resolve/main/example_images/Dense%20forest.webp" width="400"> | | Prompt: Black pattern mug | <img src="https://huggingface.co/KORMo-VL/KORMo-VL-Diffusion/resolve/main/example_images/black%20pattern%20mug%20cpup.webp" width="400"> |

Korean Prompt

| Prompt | Generated Image | | :--- | :--- | | Prompt: 울창한 숲 | <img src="https://huggingface.co/KORMo-VL/KORMo-VL-Diffusion/resolve/main/example_images/Dense%20forest.webp" width="400"> | | Prompt: 검은 무늬의 머그컵 | <img src="https://huggingface.co/KORMo-VL/KORMo-VL-Diffusion/resolve/main/example_images/%EA%B2%80%EC%9D%80%20%EB%AC%B4%EB%8A%AC%EC%9D%98%20%EB%A8%B8%EA%B7%B8%EC%BB%B5.webp" width="400"> |

KORMo-VL-Diffusion Demo

prompt: 아름다운 정원의 꽃들

<video width="640" height="360" controls> <source src="https://huggingface.co/KORMo-VL/KORMo-VL-Diffusion/resolve/main/kormo_diffusion_assets/kormo_t2i.mp4" type="video/mp4"> </video>

📦 Installation

uv pip install transformers==4.57.1 pillow torchvision diffusers

🚀 Inference Example

github repo 활용 예정

Contact

  • KyungTae Lim, Professor at KAIST. ktlim@kaist.ac.kr

Contributor (https://sites.google.com/view/aailab)

  • Junghun Yuk
  • INho won
  • HANGYEOL YOO
  • Junmyeong Lee
  • KyungTae Lim

Citation

@misc{KORMo,
  author = {Minjun Kim, Hyeonseok Lim, Hangyeol Yoo, Inho Won, Seungwoo Song, Minkyung Cho, Junghun Yuk, Changsu Choi, Dongjae Shin, Huije Lee, Hoyun Song, Alice Oh, and KyungTae Lim},
  title = {KORMo: Korean Open Reasoning Model for Everyone},
  year = {2025},
  publisher = {GitHub},
  journal = {Technical Report},
  paperLink = {\url{https://arxiv.org/abs/2510.09426}},
 },
}

Author: KORMo-VL

Likes: 5

Downloads: 0

Tags: diffusers, safetensors, arxiv:2510.09426, license:apache-2.0, diffusers:KormoImagePipeline, region:us

Lightricks/LTX-2.3-fp8


language:

  • en
  • de
  • es
  • fr
  • ja
  • ko
  • zh
  • it
  • pt library_name: diffusers license: other license_name: ltx-2-community-license-agreement license_link: https://github.com/Lightricks/LTX-2/blob/main/LICENSE pipeline_tag: image-to-video arxiv: 2601.03233 tags:
  • image-to-video
  • text-to-video
  • video-to-video
  • image-text-to-video
  • audio-to-video
  • text-to-audio
  • video-to-audio
  • audio-to-audio
  • text-to-audio-video
  • image-to-audio-video
  • image-text-to-audio-video
  • ltx-2
  • ltx-2-3
  • ltx-video
  • ltxv
  • lightricks pinned: true demo: https://app.ltx.studio/ltx-2-playground/i2v

LTX-2.3 FP8 Model Card

This is the FP8 versions of the LTX-2.3 model. All information below is derived from the base model.

This model card focuses on the LTX-2.3 model, which is a significant update to the LTX-2 model with improved audio and visual quality as well as enhanced prompt adherence. LTX-2 was presented in the paper LTX-2: Efficient Joint Audio-Visual Foundation Model.

💻💻 If you want to dive in right to the code - it is available here. 💾💾

LTX-2.3 is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model. It brings together the core building blocks of modern video generation, with open weights and a focus on practical, local execution.

LTX-2.3 Open Source

Model Checkpoints

| Name | Notes | |------------------------------------|--------------------------------------------------------------------------------------------------------------------| | ltx-2.3-22b-dev-fp8 | The full model, flexible and trainable in bf16 | | ltx-2.3-22b-distilled-fp8 (coming soon) | The distilled version of the full model, 8 steps, CFG=1 |

Model Details

  • Developed by: Lightricks
  • Model type: Diffusion-based audio-video foundation model
  • Language(s): English

Online demo

LTX-2.3 is accessible right away via the API Playground.

Run locally

Direct use license

You can use the models - full, distilled, upscalers and any derivatives of the models - for purposes under the license.

ComfyUI

We recommend you use the built-in LTXVideo nodes that can be found in the ComfyUI Manager. For manual installation information, please refer to our documentation site.

PyTorch codebase

The LTX-2 codebase is a monorepo with several packages. From model definition in 'ltx-core' to pipelines in 'ltx-pipelines' and training capabilities in 'ltx-trainer'. The codebase was tested with Python >=3.12, CUDA version >12.7, and supports PyTorch ~= 2.7.

Installation

git clone https://github.com/Lightricks/LTX-2.git
cd LTX-2

# From the repository root
uv sync
source .venv/bin/activate

Inference

To use our model, please follow the instructions in our ltx-pipelines package.

Diffusers 🧨

LTX-2.3 support in the Diffusers Python library is coming soon!

General tips:

  • Width & height settings must be divisible by 32. Frame count must be divisible by 8 + 1.
  • In case the resolution or number of frames are not divisible by 32 or 8 + 1, the input should be padded with -1 and then cropped to the desired resolution and number of frames.
  • For tips on writing effective prompts, please visit our Prompting guide

Limitations

  • This model is not intended or able to provide factual information.
  • As a statistical model this checkpoint might amplify existing societal biases.
  • The model may fail to generate videos that matches the prompts perfectly.
  • Prompt following is heavily influenced by the prompting-style.
  • The model may generate content that is inappropriate or offensive.
  • When generating audio without speech, the audio may be of lower quality.

Train the model

Currently it is recommended to train the bf16 model. Recipes for training the fp8 model are welcome as community contributions.

Citation

@article{hacohen2025ltx2,
  title={LTX-2: Efficient Joint Audio-Visual Foundation Model},
  author={HaCohen, Yoav and Brazowski, Benny and Chiprut, Nisan and Bitterman, Yaki and Kvochko, Andrew and Berkowitz, Avishai and Shalem, Daniel and Lifschitz, Daphna and Moshe, Dudu and Porat, Eitan and Richardson, Eitan and Guy Shiran and Itay Chachy and Jonathan Chetboun and Michael Finkelson and Michael Kupchick and Nir Zabari and Nitzan Guetta and Noa Kotler and Ofir Bibi and Ori Gordon and Poriya Panet and Roi Benita and Shahar Armon and Victor Kulikov and Yaron Inger and Yonatan Shiftan and Zeev Melumian and Zeev Farbman},
  journal={arXiv preprint arXiv:2601.03233},
  year={2025}
}

Author: Lightricks

Likes: 4

Downloads: 0

Tags: diffusers, image-to-video, text-to-video, video-to-video, image-text-to-video, audio-to-video, text-to-audio, video-to-audio, audio-to-audio, text-to-audio-video, image-to-audio-video, image-text-to-audio-video, ltx-2, ltx-2-3, ltx-video, ltxv, lightricks, en, de, es, fr, ja, ko, zh, it, pt, arxiv:2601.03233, license:other, region:us

HauhauCS/Qwen3.5-27B-Uncensored-HauhauCS-Aggressive


license: apache-2.0 tags:

  • uncensored
  • qwen3.5
  • qwen
  • gguf language:
  • en
  • zh
  • multilingual

Qwen3.5-27B-Uncensored-HauhauCS-Aggressive

Qwen3.5-27B uncensored by HauhauCS.

About

0/465 refusals. Fully uncensored with zero capability loss.

No changes to datasets or capabilities. Fully functional, 100% of what the original authors intended - just without the refusals.

These are meant to be the best lossless uncensored models out there.

Aggressive Variant

Stronger uncensoring with more thorough refusal removal. If this variant is too loose for your use case, a Balanced variant may follow.

Note: The model is fully unlocked and will not refuse prompts. However, it may occasionally append a short disclaimer at the end of a response (e.g. "This is general information, not legal advice..."). This is baked into the base model's training and not a refusal — the actual content is still generated in full.

Downloads

| File | Quant | Size | |------|-------|------| | Qwen3.5-27B-Uncensored-HauhauCS-Aggressive-BF16.gguf | BF16 | 51 GB | | Qwen3.5-27B-Uncensored-HauhauCS-Aggressive-Q8_0.gguf | Q8_0 | 27 GB | | Qwen3.5-27B-Uncensored-HauhauCS-Aggressive-Q6_K.gguf | Q6_K | 21 GB | | Qwen3.5-27B-Uncensored-HauhauCS-Aggressive-Q5_K_M.gguf | Q5_K_M | 19 GB | | Qwen3.5-27B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf | Q4_K_M | 16 GB | | Qwen3.5-27B-Uncensored-HauhauCS-Aggressive-IQ4_XS.gguf | IQ4_XS | 14 GB | | Qwen3.5-27B-Uncensored-HauhauCS-Aggressive-Q3_K_M.gguf | Q3_K_M | 13 GB | | Qwen3.5-27B-Uncensored-HauhauCS-Aggressive-IQ3_M.gguf | IQ3_M | 12 GB | | Qwen3.5-27B-Uncensored-HauhauCS-Aggressive-IQ2_M.gguf | IQ2_M | 8.8 GB | | mmproj-Qwen3.5-27B-Uncensored-HauhauCS-Aggressive-f16.gguf | Vision encoder | 885 MB |

IQ quants (IQ2_M, IQ3_M, IQ4_XS) were generated with importance matrix calibration for better quality at low bit rates.

Vision support: This model is natively multimodal. The mmproj file is the vision encoder — you need it alongside the main GGUF to use image/video inputs. Load both files in llama.cpp, LM Studio, or any compatible runtime.

Specs

  • 27B dense parameters, 64 layers
  • Hybrid architecture: Gated DeltaNet linear attention + full softmax attention (3:1 ratio)
  • 262K native context (extendable to 1M with YaRN)
  • Natively multimodal (text, image, video)
  • Multi-token prediction (MTP) support
  • 248K vocabulary, 201 languages
  • Based on Qwen3.5-27B

Recommended Settings

From the official Qwen authors:

Thinking mode (default):

  • temperature=0.6, top_p=0.95, top_k=20, min_p=0

Non-thinking mode:

  • temperature=0.7, top_p=0.8, top_k=20, min_p=0

Important:

  • Maintain at least 128K context to preserve thinking capabilities
  • For production/high-throughput: use vLLM, SGLang, or KTransformers

Note: This is a brand new architecture (released 2026-03-02). llama.cpp support landed very recently — make sure you're on a recent build. Works with llama.cpp, LM Studio, Jan, koboldcpp, etc.

Also check out the 9B variant, the 4B variant, and all releases at HauhauCS.

Usage

Works with llama.cpp, LM Studio, Jan, koboldcpp, etc.

Author: HauhauCS

Likes: 4

Downloads: 0

Tags: gguf, uncensored, qwen3.5, qwen, en, zh, multilingual, license:apache-2.0, endpoints_compatible, region:us, conversational

drbaph/LTX-2.3-FP8


language:

  • en
  • de
  • es
  • fr
  • ja
  • ko
  • zh
  • it
  • pt library_name: diffusers license: other license_name: ltx-2-community-license-agreement license_link: https://github.com/Lightricks/LTX-2/blob/main/LICENSE pipeline_tag: image-to-video arxiv: 2601.03233 tags:
  • image-to-video
  • text-to-video
  • video-to-video
  • image-text-to-video
  • audio-to-video
  • text-to-audio
  • video-to-audio
  • audio-to-audio
  • text-to-audio-video
  • image-to-audio-video
  • image-text-to-audio-video
  • ltx-2
  • ltx-2-3
  • ltx-video
  • ltxv
  • lightricks pinned: true demo: https://app.ltx.studio/ltx-2-playground/i2v

LTX-2.3 FP8 Quantized

FP8 quantized versions of the LTX-2.3 22B models by Lightricks.

LTX-2 Open Source

Quantized Checkpoints

| Name | Original | Size | |------|----------|------| | ltx-2.3-22b-dev-fp8_mixed.safetensors | ltx-2.3-22b-dev | ~30 GB | | ltx-2.3-22b-distilled-fp8_mixed.safetensors | ltx-2.3-22b-distilled | ~30 GB |

Quantization Details

  • Format: float8_e4m3fn (E4M3, max=448)
  • Method: Static per-tensor W8A8 quantization
  • Scope: Transformer blocks 1–42 (block 0 and last 5 blocks kept in BF16)
  • Targets: All linear projection weight matrices in attn1, attn2, audio_attn1, audio_attn2, audio_to_video_attn, video_to_audio_attn, ff.net, audio_ff.net — specifically to_q, to_k, to_v, to_out.0, ff.net.0.proj, ff.net.2 and their audio equivalents
  • Scale: Per-tensor weight_scale = max(|W|) / 448 stored as F32 scalar alongside each weight. Static input_scale = 1.0 placeholder matching the source model format
  • Non-quantized: Biases, norms, scale_shift_tables, gate_logits kept as BF16/F32
  • Quantized tensors: 1176 / 5947 total (28 patterns × 42 blocks)
  • Output size: ~29.94 GB (down from ~46 GB BF16)

Original Model

This is a quantized derivative of Lightricks/LTX-2.3. All original model details, usage instructions, and license terms apply.

LTX-2.3 is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model.

Citation

@article{hacohen2025ltx2,
  title={LTX-2: Efficient Joint Audio-Visual Foundation Model},
  author={HaCohen, Yoav and Brazowski, Benny and Chiprut, Nisan and Bitterman, Yaki and Kvochko, Andrew and Berkowitz, Avishai and Shalem, Daniel and Lifschitz, Daphna and Moshe, Dudu and Porat, Eitan and Richardson, Eitan and Guy Shiran and Itay Chachy and Jonathan Chetboun and Michael Finkelson and Michael Kupchick and Nir Zabari and Nitzan Guetta and Noa Kotler and Ofir Bibi and Ori Gordon and Poriya Panet and Roi Benita and Shahar Armon and Victor Kulikov and Yaron Inger and Yonatan Shiftan and Zeev Melumian and Zeev Farbman},
  journal={arXiv preprint arXiv:2601.03233},
  year={2025}
}

Author: drbaph

Likes: 4

Downloads: 0

Tags: diffusers, image-to-video, text-to-video, video-to-video, image-text-to-video, audio-to-video, text-to-audio, video-to-audio, audio-to-audio, text-to-audio-video, image-to-audio-video, image-text-to-audio-video, ltx-2, ltx-2-3, ltx-video, ltxv, lightricks, en, de, es, fr, ja, ko, zh, it, pt, arxiv:2601.03233, license:other, region:us

Aratako/Semantic-DACVAE-Japanese


license: mit language:

  • ja base_model:
  • facebook/dacvae-watermarked pipeline_tag: audio-to-audio tags:
  • VAE
  • Speech
  • Voice

Semantic-DACVAE-Japanese

Semantic-DACVAE-Japanese is a fine-tuned version of facebook/dacvae-watermarked, specifically optimized for Japanese speech.

By incorporating WavLM semantic distillation inspired by the Semantic-VAE paper and performing additional training on Japanese datasets, this model achieves more natural reconstructions and higher subjective quality for Japanese audio compared to the original base model.

Furthermore, according to the Semantic-VAE paper, this semantic distillation approach should also improve the training efficiency and performance of downstream TTS models.

🌟 Overview

📊 Evaluation

We evaluated the model using the UTMOSv2 metric to measure the quality of the reconstructed audio. Our model demonstrates a clear improvement in naturalness over the original base model across both tested datasets.

1. Emilia-YODAS (Japanese Subset)

Tested on 100 samples (not included in training) from the Japanese subset of amphion/Emilia-Dataset.

| Audio Source | Mean UTMOSv2 (n=100) | | :--- | :---: | | Original Audio | 2.2099 | | facebook/dacvae-watermarked | 2.2841 | | Aratako/Semantic-DACVAE-Japanese | 2.4812 |

2. Private Test Dataset (Japanese)

Tested on 100 private Japanese speech samples.

| Audio Source | Mean UTMOSv2 (n=100) | | :--- | :---: | | Original Audio | 2.0322 | | facebook/dacvae-watermarked | 1.8775 | | Aratako/Semantic-DACVAE-Japanese | 2.1629 |

🚀 Quick Start

Installation

First, set up your environment and install the official repository:

# Create a virtual environment
uv venv --python=3.10

# Install the official dacvae package
uv pip install git+https://github.com/facebookresearch/dacvae

Inference

Below is a basic example of inference.

import soundfile as sf
import torch
import torchaudio
from audiotools import AudioSignal
from dacvae import DACVAE
from huggingface_hub import hf_hub_download

# 1. Load the model
model = DACVAE.load(hf_hub_download(repo_id="Aratako/Semantic-DACVAE-Japanese", filename="weights.pth")).eval()

# Disable/bypass the default watermark since this model was fine-tuned without it
model.decoder.alpha = 0.0
model.decoder.watermark = lambda x, message=None, d=model.decoder: d.wm_model.encoder_block.forward_no_conv(x)

# 2. Load and preprocess audio
wav_np, sr = sf.read("input.wav", dtype="float32")
wav = torch.from_numpy(wav_np.T) if wav_np.ndim == 2 else torch.from_numpy(wav_np).unsqueeze(0)
wav = torchaudio.functional.resample(wav.mean(0, keepdim=True), sr, model.sample_rate)

signal = AudioSignal(wav.unsqueeze(0), model.sample_rate)
signal.normalize(-16.0)
signal.ensure_max_of_audio()
x = signal.audio_data.float()  # (1, 1, T)

# 3. Encode and Decode
with torch.no_grad():
    z = model.encoder(model._pad(x))
    z, _ = model.quantizer.in_proj(z).chunk(2, dim=1)
    y = model.decode(z)[0].cpu()

# 4. Save reconstructed audio
sf.write("recon.wav", y.squeeze(0).numpy(), model.sample_rate)

📜 Acknowledgements

🖊️ Citation

@misc{semantic-dacvae-japanese,
  author = {Chihiro Arata},
  title = {Semantic-DACVAE-Japanese: Audio VAE for Japanese Speech},
  year = {2026},
  publisher = {Hugging Face},
  journal = {Hugging Face repository},
  howpublished = {\url{https://huggingface.co/Aratako/Semantic-DACVAE-Japanese}}
}

Author: Aratako

Likes: 4

Downloads: 0

Tags: VAE, Speech, Voice, audio-to-audio, ja, arxiv:2509.22167, base_model:facebook/dacvae-watermarked, base_model:finetune:facebook/dacvae-watermarked, license:mit, region:us