Todays AI Summary

AI Developments: Mixture-of-Experts Models, Steerable 3D Generation, and More

Today's AI landscape sees advancements in various domains, from language models with innovative architectures to methods for enhancing 3D asset generation and educational resource alignment.

Noteworthy Research Papers

  • DiffusionBrowser: Interactive Diffusion Previews via Multi-Branch Decoders: This paper introduces a model-agnostic decoder framework for video diffusion models, enabling interactive generation previews with multi-modal representations at real-time speeds. It unlocks new control capabilities through interactive guidance at intermediate noise steps.
  • Feedforward 3D Editing via Text-Steerable Image-to-3D: Steer3D, a feedforward method, adds text steerability to image-to-3D models, allowing language-based editing of generated 3D assets. It uses a scalable data engine and a two-stage training recipe, achieving faster performance while maintaining consistency.
  • Embedding-Based Rankings of Educational Resources based on Learning Outcome Alignment: This research proposes a framework for automating the evaluation of alignment between educational resources and intended learning outcomes using LLM-based text-embedding models. The study demonstrates that higher alignment scores correlate with greater learning performance.
  • Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models: This paper introduces Cascade RL, orchestrating sequential, domain-wise RL to develop general-purpose reasoning models. The 14B model outperforms its SFT teacher, DeepSeek-R1-0528, on LiveCodeBench v5/v6/Pro and achieves silver-medal performance in the 2025 International Olympiad in Informatics (IOI).

New Models

  • XiaomiMiMo/MiMo-V2-Flash: This Mixture-of-Experts (MoE) language model features 309B total parameters (15B active). It employs a hybrid attention architecture and Multi-Token Prediction (MTP) for high-speed reasoning and agentic workflows. It achieves strong performance across standard benchmarks and excels in long-context tasks.
  • nvidia/Nemotron-Cascade-8B: This general-purpose model is trained through sequential and domain-wise reinforcement learning. It operates in both thinking and instruct modes, achieving best-in-class performance across various benchmarks.
  • APRIL-AIGC/T3-Video: This model focuses on native 4K video generation, leveraging a Transform Trained Transformer architecture to accelerate the process.
  • MiniMaxAI/VTP-Base-f16d64, MiniMaxAI/VTP-Large-f16d64, MiniMaxAI/VTP-Small-f16d64: These models are part of the Visual Tokenizer Pre-training (VTP) framework, which integrates contrastive, self-supervised, and reconstruction learning for improved image generation.
  • allenai/Molmo2-VideoPoint-4B: This model is part of the Molmo2 family of open vision-language models, focusing on video pointing and counting tasks. It is fine-tuned on the Molmo2-VideoPoint data and demonstrates competitive performance in video counting evaluations.

Key Takeaways

  • Efficient Architectures: Models like MiMo-V2-Flash are pushing the boundaries of efficiency by using Mixture-of-Experts architectures and novel attention mechanisms to reduce inference costs while maintaining high performance.
  • Reinforcement Learning for Reasoning: Nemotron-Cascade highlights the effectiveness of cascaded reinforcement learning in building general-purpose reasoning models.
  • Scalable Pre-training: The VTP framework addresses the "pre-training scaling problem" by emphasizing high-level semantics in visual tokenizers, leading to better generative performance with increased compute.
  • Multimodal Advancements: Molmo2-VideoPoint-4B showcases the progress in video understanding and grounding, with a focus on specific tasks like video pointing and counting.
  • Text-Steerable 3D Generation: Steer3D demonstrates the potential of adding text steerability to image-to-3D models, enabling intuitive editing of generated 3D assets.

AI Papers for 2026-02-06

Protein Autoregressive Modeling via Multiscale Structure Generation

We present protein autoregressive modeling (PAR), the first multi-scale autoregressive framework for protein backbone generation via coarse-to-fine next-scale prediction. Using the hierarchical nature of proteins, PAR generates structures that mimic sculpting a statue, forming a coarse topology and refining structural details over scales. To achieve this, PAR consists of three key components: (i) multi-scale downsampling operations that represent protein structures across multiple scales during training; (ii) an autoregressive transformer that encodes multi-scale information and produces conditional embeddings to guide structure generation; (iii) a flow-based backbone decoder that generates backbone atoms conditioned on these embeddings. Moreover, autoregressive models suffer from exposure bias, caused by the training and the generation procedure mismatch, and substantially degrades structure generation quality. We effectively alleviate this issue by adopting noisy context learning and scheduled sampling, enabling robust backbone generation. Notably, PAR exhibits strong zero-shot generalization, supporting flexible human-prompted conditional generation and motif scaffolding without requiring fine-tuning. On the unconditional generation benchmark, PAR effectively learns protein distributions and produces backbones of high design quality, and exhibits favorable scaling behavior. Together, these properties establish PAR as a promising framework for protein structure generation.

Contrastive Continual Learning for Model Adaptability in Internet of Things

Internet of Things (IoT) deployments operate in nonstationary, dynamic environments where factors such as sensor drift, evolving user behavior, and heterogeneous user privacy requirements can affect application utility. Continual learning (CL) addresses this by adapting models over time without catastrophic forgetting. Meanwhile, contrastive learning has emerged as a powerful representation-learning paradigm that improves robustness and sample efficiency in a self-supervised manner. This paper reviews the usage of \emph{contrastive continual learning} (CCL) for IoT, connecting algorithmic design (replay, regularization, distillation, prompts) with IoT system realities (TinyML constraints, intermittent connectivity, privacy). We present a unifying problem formulation, derive common objectives that blend contrastive and distillation losses, propose an IoT-oriented reference architecture for on-device, edge, and cloud-based CCL, and provide guidance on evaluation protocols and metrics. Finally, we highlight open unique challenges with respect to the IoT domain, such as spanning tabular and streaming IoT data, concept drift, federated settings, and energy-aware training.

Rethinking the Trust Region in LLM Reinforcement Learning

Reinforcement learning (RL) has become a cornerstone for fine-tuning Large Language Models (LLMs), with Proximal Policy Optimization (PPO) serving as the de facto standard algorithm. Despite its ubiquity, we argue that the core ratio clipping mechanism in PPO is structurally ill-suited for the large vocabularies inherent to LLMs. PPO constrains policy updates based on the probability ratio of sampled tokens, which serves as a noisy single-sample Monte Carlo estimate of the true policy divergence. This creates a sub-optimal learning dynamic: updates to low-probability tokens are aggressively over-penalized, while potentially catastrophic shifts in high-probability tokens are under-constrained, leading to training inefficiency and instability. To address this, we propose Divergence Proximal Policy Optimization (DPPO), which substitutes heuristic clipping with a more principled constraint based on a direct estimate of policy divergence (e.g., Total Variation or KL). To avoid huge memory footprint, we introduce the efficient Binary and Top-K approximations to capture the essential divergence with negligible overhead. Extensive empirical evaluations demonstrate that DPPO achieves superior training stability and efficiency compared to existing methods, offering a more robust foundation for RL-based LLM fine-tuning.

Multi-layer Cross-Attention is Provably Optimal for Multi-modal In-context Learning

Recent progress has rapidly advanced our understanding of the mechanisms underlying in-context learning in modern attention-based neural networks. However, existing results focus exclusively on unimodal data; in contrast, the theoretical underpinnings of in-context learning for multi-modal data remain poorly understood. We introduce a mathematically tractable framework for studying multi-modal learning and explore when transformer-like architectures can recover Bayes-optimal performance in-context. To model multi-modal problems, we assume the observed data arises from a latent factor model. Our first result comprises a negative take on expressibility: we prove that single-layer, linear self-attention fails to recover the Bayes-optimal predictor uniformly over the task distribution. To address this limitation, we introduce a novel, linearized cross-attention mechanism, which we study in the regime where both the number of cross-attention layers and the context length are large. We show that this cross-attention mechanism is provably Bayes optimal when optimized using gradient flow. Our results underscore the benefits of depth for in-context learning and establish the provable utility of cross-attention for multi-modal distributions.

CRoSS: A Continual Robotic Simulation Suite for Scalable Reinforcement Learning with High Task Diversity and Realistic Physics Simulation

Continual reinforcement learning (CRL) requires agents to learn from a sequence of tasks without forgetting previously acquired policies. In this work, we introduce a novel benchmark suite for CRL based on realistically simulated robots in the Gazebo simulator. Our Continual Robotic Simulation Suite (CRoSS) benchmarks rely on two robotic platforms: a two-wheeled differential-drive robot with lidar, camera and bumper sensor, and a robotic arm with seven joints. The former represent an agent in line-following and object-pushing scenarios, where variation of visual and structural parameters yields a large number of distinct tasks, whereas the latter is used in two goal-reaching scenarios with high-level cartesian hand position control (modeled after the Continual World benchmark), and low-level control based on joint angles. For the robotic arm benchmarks, we provide additional kinematics-only variants that bypass the need for physical simulation (as long as no sensor readings are required), and which can be run two orders of magnitude faster. CRoSS is designed to be easily extensible and enables controlled studies of continual reinforcement learning in robotic settings with high physical realism, and in particular allow the use of almost arbitrary simulated sensors. To ensure reproducibility and ease of use, we provide a containerized setup (Apptainer) that runs out-of-the-box, and report performances of standard RL algorithms, including Deep Q-Networks (DQN) and policy gradient methods. This highlights the suitability as a scalable and reproducible benchmark for CRL research.

Subliminal Effects in Your Data: A General Mechanism via Log-Linearity

Training modern large language models (LLMs) has become a veritable smorgasbord of algorithms and datasets designed to elicit particular behaviors, making it critical to develop techniques to understand the effects of datasets on the model's properties. This is exacerbated by recent experiments that show datasets can transmit signals that are not directly observable from individual datapoints, posing a conceptual challenge for dataset-centric understandings of LLM training and suggesting a missing fundamental account of such phenomena. Towards understanding such effects, inspired by recent work on the linear structure of LLMs, we uncover a general mechanism through which hidden subtexts can arise in generic datasets. We introduce Logit-Linear-Selection (LLS), a method that prescribes how to select subsets of a generic preference dataset to elicit a wide range of hidden effects. We apply LLS to discover subsets of real-world datasets so that models trained on them exhibit behaviors ranging from having specific preferences, to responding to prompts in a different language not present in the dataset, to taking on a different persona. Crucially, the effect persists for the selected subset, across models with varying architectures, supporting its generality and universality.

From Evaluation to Design: Using Potential Energy Surface Smoothness Metrics to Guide Machine Learning Interatomic Potential Architectures

Machine Learning Interatomic Potentials (MLIPs) sometimes fail to reproduce the physical smoothness of the quantum potential energy surface (PES), leading to erroneous behavior in downstream simulations that standard energy and force regression evaluations can miss. Existing evaluations, such as microcanonical molecular dynamics (MD), are computationally expensive and primarily probe near-equilibrium states. To improve evaluation metrics for MLIPs, we introduce the Bond Smoothness Characterization Test (BSCT). This efficient benchmark probes the PES via controlled bond deformations and detects non-smoothness, including discontinuities, artificial minima, and spurious forces, both near and far from equilibrium. We show that BSCT correlates strongly with MD stability while requiring a fraction of the cost of MD. To demonstrate how BSCT can guide iterative model design, we utilize an unconstrained Transformer backbone as a testbed, illustrating how refinements such as a new differentiable $k$-nearest neighbors algorithm and temperature-controlled attention reduce artifacts identified by our metric. By optimizing model design systematically based on BSCT, the resulting MLIP simultaneously achieves a low conventional E/F regression error, stable MD simulations, and robust atomistic property predictions. Our results establish BSCT as both a validation metric and as an "in-the-loop" model design proxy that alerts MLIP developers to physical challenges that cannot be efficiently evaluated by current MLIP benchmarks.

El Agente Quntur: A research collaborator agent for quantum chemistry

Quantum chemistry is a foundational enabling tool for the fields of chemistry, materials science, computational biology and others. Despite of its power, the practical application of quantum chemistry simulations remains in the hands of qualified experts due to methodological complexity, software heterogeneity, and the need for informed interpretation of results. To bridge the accessibility gap for these tools and expand their reach to chemists with broader backgrounds, we introduce El Agente Quntur, a hierarchical, multi-agent AI system designed to operate not merely as an automation tool but as a research collaborator for computational quantum chemistry. Quntur was designed following three main strategies: i) elimination of hard-coded procedural policies in favour of reasoning-driven decisions, ii) construction of general and composable actions that facilitate generalization and efficiency, and iii) implementation of guided deep research to integrate abstract quantum-chemical reasoning across subdisciplines and a detailed understanding of the software's internal logic and syntax. Although instantiated in ORCA, these design principles are applicable to research agents more generally and easily expandable to additional quantum chemistry packages and beyond. Quntur supports the full range of calculations available in ORCA 6.0 and reasons over software documentation and scientific literature to plan, execute, adapt, and analyze in silico chemistry experiments following best practices. We discuss the advances and current bottlenecks in agentic systems operating at the research level in computational chemistry, and outline a roadmap toward a fully autonomous end-to-end computational chemistry research agent.

El Agente Estructural: An Artificially Intelligent Molecular Editor

We present El Agente Estructural, a multimodal, natural-language-driven geometry-generation and manipulation agent for autonomous chemistry and molecular modelling. Unlike molecular generation or editing via generative models, Estructural mimics how human experts directly manipulate molecular systems in three dimensions by integrating a comprehensive set of domain-informed tools and vision-language models. This design enables precise control over atomic or functional group replacements, atomic connectivity, and stereochemistry without the need to rebuild extensive core molecular frameworks. Through a series of representative case studies, we demonstrate that Estructural enables chemically meaningful geometry manipulation across a wide range of real-world scenarios. These include site-selective functionalization, ligand binding, ligand exchange, stereochemically controlled structure construction, isomer interconversion, fragment-level structural analysis, image-guided generation of structures from schematic reaction mechanisms, and mechanism-driven geometry generation and modification. These examples illustrate how multimodal reasoning, when combined with specialized geometry-aware tools, supports interactive and context-aware molecular modelling beyond structure generation. Looking forward, the integration of Estructural into El Agente Quntur, an autonomous multi-agent quantum chemistry platform, enhances its capabilities by adding sophisticated tools for the generation and editing of three-dimensional structures.

Fluid Representations in Reasoning Models

Reasoning language models, which generate long chains of thought, dramatically outperform non-reasoning language models on abstract problems. However, the internal model mechanisms that allow this superior performance remain poorly understood. We present a mechanistic analysis of how QwQ-32B - a model specifically trained to produce extensive reasoning traces - process abstract structural information. On Mystery Blocksworld - a semantically obfuscated planning domain - we find that QwQ-32B gradually improves its internal representation of actions and concepts during reasoning. The model develops abstract encodings that focus on structure rather than specific action names. Through steering experiments, we establish causal evidence that these adaptations improve problem solving: injecting refined representations from successful traces boosts accuracy, while symbolic representations can replace many obfuscated encodings with minimal performance loss. We find that one of the factors driving reasoning model performance is in-context refinement of token representations, which we dub Fluid Reasoning Representations.

AI Models

SicariusSicariiStuff/Impish_Bloodmoon_12B_Abliterated


license: apache-2.0 language:

  • en widget:
  • text: Impish_Bloodmoon_12B_Abliterated output: url: >- https://huggingface.co/SicariusSicariiStuff/Impish_Bloodmoon_12B_Abliterated/resolve/main/Images/Abliterated.png base_model:
  • SicariusSicariiStuff/Impish_Bloodmoon_12B

<div align="center"> <b style="font-size: 50px;">Impish_Bloodmoon_12B</b> </div> <div align="center"> <b style="font-size: 80px;">Abliterated</b> </div>
<div align="center" style="font-size: 18px; margin-top: 20px;"> <b>Developed by:</b> <a href="https://huggingface.co/SicariusSicariiStuff">SicariusSicariiStuff</a> </div>

Impish_Bloodmoon_12B_Abliterated is an abliterated variant of SicariusSicariiStuff/Impish_Bloodmoon_12B with surgical removal of refusal mechanisms. This model maintains the full capabilities of the original, while eliminating safety guardrails through orthogonalization techniques.

KL divergence

<0.02

Refusals

~3%

What is KL divergence?

Think about it as a way to measure the variance between the original model "World Model," vs the abliterated one; the lower the KL divergence, the closer the "World Model" of the two models to each other.

If the original model thinks making pineapple pizza is a crime against humanity (it is), then the abliterated model will still hold to this belief, but if asked how to make one (probably after giving you a disclaimer about what an abomination that is), it would still tell you how. In other words, most of the knowledge, quirks, and capabilities are preserved.


Technical Specs

  • Base Model: Impish_Bloodmoon_12B
  • Parameters: 12B
  • Context Length: 128K tokens
  • Architecture: Mistral (decoder-only transformer)
  • Precision: bf16
  • Method: Orthogonalization-based abliteration
  • License: apache-2.0

Methodology

  1. Identifies refusal direction vectors in activation space
  2. Orthogonalizes weights to inhibit activation along these directions
  3. Preserves (mostly) all other model behaviors and knowledge

Model Details

  • Intended use: General Tasks, Roleplay.

  • Censorship level: <b>Very Low</b>

  • X / 10 (10 completely uncensored)

UGI score:

pending evals


Citation Information

@llm{Impish_Bloodmoon_12B_Abliterated,
  author = {SicariusSicariiStuff},
  title = {Impish_Bloodmoon_12B_Abliterated},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/SicariusSicariiStuff/Impish_Bloodmoon_12B_Abliterated}
}

Other stuff

Author: SicariusSicariiStuff

Likes: 3

Downloads: 0

Tags: safetensors, mistral, en, base_model:SicariusSicariiStuff/Impish_Bloodmoon_12B, base_model:finetune:SicariusSicariiStuff/Impish_Bloodmoon_12B, license:apache-2.0, region:us

attilasir/AttilaAI-Q4_K_M-GGUF


base_model: attilasir/AttilaAI tags:

  • text-generation-inference
  • transformers
  • unsloth
  • gemma3n
  • llama-cpp
  • gguf-my-repo license: apache-2.0 language:
  • en

attilasir/AttilaAI-Q4_K_M-GGUF

This model was converted to GGUF format from attilasir/AttilaAI using llama.cpp via the ggml.ai's GGUF-my-repo space. Refer to the original model card for more details on the model.

Use with llama.cpp

Install llama.cpp through brew (works on Mac and Linux)

brew install llama.cpp

Invoke the llama.cpp server or the CLI.

CLI:

llama-cli --hf-repo attilasir/AttilaAI-Q4_K_M-GGUF --hf-file attilaai-q4_k_m.gguf -p "The meaning to life and the universe is"

Server:

llama-server --hf-repo attilasir/AttilaAI-Q4_K_M-GGUF --hf-file attilaai-q4_k_m.gguf -c 2048

Note: You can also use this checkpoint directly through the usage steps listed in the Llama.cpp repo as well.

Step 1: Clone llama.cpp from GitHub.

git clone https://github.com/ggerganov/llama.cpp

Step 2: Move into the llama.cpp folder and build it with LLAMA_CURL=1 flag along with other hardware-specific flags (for ex: LLAMA_CUDA=1 for Nvidia GPUs on Linux).

cd llama.cpp && LLAMA_CURL=1 make

Step 3: Run inference through the main binary.

./llama-cli --hf-repo attilasir/AttilaAI-Q4_K_M-GGUF --hf-file attilaai-q4_k_m.gguf -p "The meaning to life and the universe is"

or

./llama-server --hf-repo attilasir/AttilaAI-Q4_K_M-GGUF --hf-file attilaai-q4_k_m.gguf -c 2048

Author: attilasir

Likes: 2

Downloads: 0

Tags: transformers, gguf, text-generation-inference, unsloth, gemma3n, llama-cpp, gguf-my-repo, en, base_model:attilasir/AttilaAI, base_model:quantized:attilasir/AttilaAI, license:apache-2.0, endpoints_compatible, region:us, conversational

attilasir/AttilaAI


base_model: unsloth/gemma-3n-e4b-it-unsloth-bnb-4bit tags:

  • text-generation-inference
  • transformers
  • unsloth
  • gemma3n license: apache-2.0 language:
  • en

Uploaded finetuned model

  • Developed by: attilasir
  • License: apache-2.0
  • Finetuned from model : unsloth/gemma-3n-e4b-it-unsloth-bnb-4bit

This gemma3n model was trained 2x faster with Unsloth and Huggingface's TRL library.

<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>

Author: attilasir

Likes: 2

Downloads: 0

Tags: transformers, safetensors, gguf, gemma3n, image-to-text, text-generation-inference, unsloth, en, license:apache-2.0, endpoints_compatible, region:us, conversational

Just999999/ContextModel


license: apache-2.0

Author: Just999999

Likes: 1

Downloads: 0

Tags: license:apache-2.0, region:us

Just999999/PromptBase


license: apache-2.0

Author: Just999999

Likes: 1

Downloads: 0

Tags: license:apache-2.0, region:us

Just999999/TextFlow


license: apache-2.0

Author: Just999999

Likes: 1

Downloads: 0

Tags: license:apache-2.0, region:us

ysn-rfd/Spectral-Basis-Adapter


license: apache-2.0


Spectral Basis Adapter (SBA): Dynamic Efficient Fine-Tuning for Diffusion Models

Author: YSNRFD
Project Page: GitHub | HuggingFace | Civitai


Abstract

Fine-tuning large diffusion models like Stable Diffusion XL (SDXL) typically requires substantial computational resources. While Low-Rank Adaptation (LoRA) has become the standard for efficient fine-tuning, it relies on static weight updates. The Spectral Basis Adapter (SBA) introduces a novel approach: a dynamic, LoRA-inspired mechanism that replaces static adaptations with a learnable mixture of orthogonal basis matrices.

This article details the architecture, implementation, and practical application of SBA. We explore how it enables conditional adaptation using timestep and context embeddings while maintaining a low parameter footprint and minimal VRAM usage (under 11GB for SDXL training).


1. Introduction

The rapid evolution of generative AI has led to increasingly large U-Net architectures. Training these models from scratch is prohibitive for most researchers, leading to the popularity of Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA. However, standard LoRA applies a fixed delta $W$ regardless of the input conditions.

SBA addresses this by making the adaptation dynamic. Instead of a single static low-rank update, SBA utilizes a "Basis Bank" of orthogonal matrices. A lightweight gating mechanism combines these bases dynamically based on the current timestep ($t$) and text conditioning ($c$). This allows the model to adapt its behavior specifically to the semantic and temporal context of the generation step.

Key Features

  • Dynamic Adaptation: Weights change based on $t$ and $c$ embeddings via a learned gating mechanism.
  • Orthogonal Basis Bank: Uses a mixture of identity and random orthogonal matrices to preserve manifold geometry.
  • VRAM Efficient: Optimizations in the gate architecture reduce optimizer state overhead, enabling training on consumer GPUs (e.g., 16GB VRAM).
  • Seamless Integration: Injects directly into Hugging Face Diffusers UNet models without altering the base model weights.

2. Theoretical Foundation

The core operation of SBA can be defined by the following equation:

$$y = B \cdot \text{SiLU}( M(t, c) \cdot A \cdot x )$$

Where:

  • $x$: Input tensor to the linear layer.
  • $A$: Down-projection matrix (reduces dimension to rank).
  • $B$: Up-projection matrix (projects back to output dimension).
  • $M(t, c)$: The Mixing Matrix. This is the heart of SBA. It is computed dynamically for every forward pass.
  • $y$: The residual output added to the original layer's output.

The Mixing Matrix $M(t, c)$

The mixing matrix is not a static parameter but a function of the timestep embedding ($t_{emb}$) and the context embedding ($c_{emb}$).

$$M(t, c) = \sum_{i=0}^{N-1} \alpha_i(t, c) \cdot \text{Basis}_i$$

  • $\text{Basis}_i$: A set of $N$ orthogonal matrices of size $\text{rank} \times \text{rank}$.
  • $\alpha_i(t, c)$: Scalar coefficients computed by a gating network (Softmax output) that determines how much each basis contributes to the final transformation.

This formulation allows SBA to switch between different behaviors encoded in the bases, depending on whether the model is denoising high-frequency noise (early timesteps) or refining details (late timesteps).


3. Architecture & Implementation

The implementation is divided into three main components: the SpectralBasisAdapter, the LinearWithSBA wrapper, and the SBA Injector.

3.1 SpectralBasisAdapter (sba.py)

This module defines the core logic.

1. Parameter Initialization The adapter consists of three learnable parameter groups:

  • lora_A: Rank-reduction projection.
  • lora_B: Rank-restoration projection.
  • basis_bank: A 3D tensor of shape (num_bases, rank, rank).

2. The Orthogonal Basis Bank To ensure training stability and preserve the information manifold, the basis matrices are initialized to be orthogonal.

  • Basis 0: Initialized as an Identity matrix ($I$). This ensures that at $t=0$, the adapter acts roughly like a standard LoRA.
  • Basis 1..N: Initialized via QR decomposition of random Gaussian matrices.

3. The Memory-Efficient Gate A critical optimization found in the code is the simplification of the gating network.

  • Previous approach: A multi-layer perceptron (MLP).
  • Current approach: A single nn.Linear layer.
  • Reasoning: In AdamW optimization, optimizer states (momentum and variance) can take up to 8 bytes per parameter. By reducing the gate parameters from ~400k to ~10k per layer, the VRAM usage for optimizer states drops from hundreds of megabytes to single digits, significantly reducing the overall memory footprint (Total optimizer states reduced from ~240M to ~6M params).

3.2 LinearWithSBA (sba.py)

This is a wrapper module that replaces standard nn.Linear layers in the UNet.

  • It freezes the original layer weights (requires_grad=False).
  • It initializes the SpectralBasisAdapter.
  • During the forward pass, it returns Original_Output + SBA_Output.

3.3 SBA Injector (sba_injector.py)

The injector is responsible for recursively patching the UNet architecture.

Mechanism:

  1. Global Context: It defines a global storage _SBA_CONTEXT to pass timestep and context embeddings ($t_{emb}$, $c_{emb}$) to all layers without altering the function signatures of every underlying method.
  2. Monkey Patching: It overrides UNet2DConditionModel.forward to intercept added_cond_kwargs. This allows users to pass SBA embeddings using standard Diffusers arguments.
  3. Recursive Traversal: It iterates through Down, Mid, and Up blocks of the UNet.
  4. Targeted Injection:
    • Transformers: Injects into QKV projections (to_q, to_k, to_v) and output projections (to_out).
    • FFN (Optional): Can inject into FeedForward networks (GEGLU layers).
    • ResNet (Optional): Can inject into ResNet time embeddings.

VRAM Controls: The injector includes flags to skip ResNet (inject_into_resnet=False) and FFN injection. This is crucial for fitting SDXL training into limited VRAM.


4. Training Workflow

The training script (train_sba.py) demonstrates a standard PyTorch loop integrated with Hugging Face Diffusers.

Step 1: Model Loading & Injection

model_id = "stabilityai/stable-diffusion-xl-base-1.0"
unet = UNet2DConditionModel.from_pretrained(model_id, subfolder="unet")

inject_sba_into_diffusers_unet(
    unet, 
    rank=16,              # Rank for low-rank adaptation
    num_bases=4,          # Number of spectral bases
    inject_into_ffn=False, 
    inject_into_resnet=False # Disabled to save VRAM
)

Step 2: Optimizer Configuration

The optimizer is configured to treat gate parameters differently from basis/LoRA parameters. Typically, gates require a higher learning rate to converge faster.

optimizer = configure_sba_optimizer(unet, lr=1e-4, lr_gate=5e-4)

Step 3: The Forward Pass

The forward pass utilizes the UNet's added_cond_kwargs to smuggle the SBA embeddings into the global context.

# Prepare embeddings
added_cond_kwargs = {
    "text_embeds": pooled_text_embeds,
    "time_ids": added_time_ids,
    "sba_t_emb": time_emb,       # Specific to SBA
    "sba_c_emb": pooled_text_embeds # Specific to SBA
}

# Mixed precision forward pass
with torch.autocast(device_type="cuda", dtype=torch.float16):
    noise_pred = unet(dummy_latents, dummy_timesteps, 
                      encoder_hidden_states=encoder_hidden_states, 
                      added_cond_kwargs=added_cond_kwargs)

Step 4: Gradient Checkpointing

To further reduce memory, gradient checkpointing is enabled:

unet.enable_gradient_checkpointing()

This trades compute for memory by recalculating activations during the backward pass rather than storing them.


5. Performance & Diagnostics

Based on the execution logs provided, we can analyze the performance of the SBA integration on SDXL.

Configuration:

  • Model: Stable Diffusion XL Base 1.0
  • Rank: 16
  • Injection: Transformer blocks only (ResNet/FFN skipped)
  • Hardware: CUDA GPU (Mixed Precision FP16)

Results:

  • Trainable Parameters: ~30.5 Million.
    • Note: This includes only the SBA parameters. The base UNet (approx 2.6B params) remains frozen.
  • VRAM Allocation: ~10.05 GB.
    • This is highly efficient for a full-architecture modification of SDXL, fitting comfortably within consumer 12GB-16GB cards.
  • Graph Integrity: Successful backward pass (ConvolutionBackward0 object), confirming that the autograd graph flows correctly through the dynamic mixing matrix.

6. Conclusion

The Spectral Basis Adapter (SBA) represents a significant step forward in parameter-efficient fine-tuning for diffusion models. By moving from static weight updates to dynamic, condition-dependent mixtures of orthogonal bases, SBA offers a more expressive adaptation mechanism.

The implementation provided demonstrates that this expressivity does not come at the cost of usability or memory. Through clever optimizations like the simplified gate and optional injection targets, SBA makes advanced, dynamic fine-tuning of massive models like SDXL accessible on standard hardware.

Recommended Environment:

  • PyTorch >= 2.1
  • diffusers >= 0.26
  • CUDA-enabled GPU

For collaboration, questions, or access to the latest codebases, please refer to the author's profiles on GitHub or HuggingFace.

Author: ysn-rfd

Likes: 1

Downloads: 0

Tags: license:apache-2.0, region:us

kraioi/Nano-Banana-Pro-Unlimited-AI-Video-Generation

🍌 Nano Banana Pro Video Gen Unlimited

Unlimited AI Video Generation | No API Keys | 100% Free

The most accessible AI video engine for creators. Generate viral content for YouTube Shorts, TikTok, and Reels without the monthly subscriptions or API headaches.

Status Calls License Cost


🚀 The Nano Banana Edge

Most "AI Video" tools are built behind paywalls or require complex API configurations from OpenAI, Pexels, and ElevenLabs. Nano Banana Pro breaks those barriers by offering a "plug-and-play" experience.

  • ✨ Zero API Keys: No need to hunt for keys or link your credit card.
  • ♾️ Infinite Calls: No daily limits. Generate 100 videos a day if you want.
  • 🎙️ Pro Voiceovers: High-fidelity neural TTS included for free.
  • 🎥 Auto-B-Roll: Intelligent scene selection from a massive internal library.
  • 💥 Viral Styling: Built-in "Alex Hormozi" style captions and transitions.

🛠️ Features at a Glance

| Feature | Nano Banana Pro | Generic AI Tools | | :--- | :--- | :--- | | API Costs | $0.00 | $20+ per month | | Usage Limit | Unlimited | Tiered / Credits | | Setup Time | < 2 Minutes | 15+ Minutes | | Captions | Auto-Generated | Manual / Paid | | Orientation | 9:16 & 16:9 | Restricted |


📥 Installation

Ready to start creating? Nano Banana Pro is designed to be lightweight and fast.

1. Prerequisites

Ensure you have Python 3.9+ and FFmpeg installed.

2. Setup

# Clone the repository

# Navigate to the folder
cd ano-Banana-Pro-Video-Gen-Unlimited-FREE

# Install the engine
pip install -r requirements.txt

🎬 1. How to Use

Nano Banana Pro is designed for maximum simplicity. You don't need to configure complex JSON files or link API accounts.

See it in Action

Before you run the code, here is an example of what the AI generates in real-time:

Hentai anime girl

The Basic Command

Open your terminal in the project folder and run:

python app.py "Your video topic here"

Advanced Flags

Want more control? Use flags to override default settings on the fly:

  • Change Orientation:
    python app.py "Space travel" --mode landscape
  • Pick a Specific Voice:
    python app.py "Cooking tips" --voice male_energetic
  • Short vs Long:
    python app.py "History of Rome" --duration 60

Batch Processing

To generate multiple videos at once, create a topics.txt file and run:

python app.py --file topics.txt

🎨 2. Aesthetic Customization

Personalize your videos to match your brand. All settings can be found in the .env file or the config.py file.

Visual Styles (Captions & Overlays)

| Setting | Options | Description | | :--- | :--- | :--- | | CAPTION_FONT | TheBoldFont, Impact, Arial | High-retention social media fonts. | | CAPTION_COLOR | Yellow, White, Cyan, Magenta | The primary color of the text. | | STROKE_WIDTH | 1 to 5 | Adds a black outline for better readability. | | ANIMATION | Pop, Fade, Karaoke | How the text appears on screen. |

Audio & Narrative Vibe

| Setting | Tone | Best Used For | | :--- | :--- | :--- | | VOICE_HD_1 | Deep / Authoritative | Documentaries, Scary Stories. | | VOICE_HD_2 | High-Energy / Fast | Top 10 Lists, Viral Facts. | | VOICE_HD_3 | Soft / Friendly | Tutorials, Life Hacks, Poetry. | | BG_MUSIC | LoFi, Cinematic, None | Automatically mixes background tracks. |


🎨 Aesthetic Customization

You can tweak the look and feel of your videos directly in the command line or the .env file:

  • Caption Styles: Bold, Impact, Minimalist, Neon
  • Video Format: Portrait (Shorts) or Landscape (YouTube)
  • Voice Mood: Energetic, Calm, Deep, Professional

Author: kraioi

Likes: 1

Downloads: 0

Tags: region:us

mradermacher/LandAI-Base-GGUF


base_model: zhou777/LandAI-Base language:

  • en
  • zh library_name: transformers license: apache-2.0 mradermacher: readme_rev: 1 quantized_by: mradermacher tags:
  • remote-sensing
  • geospatial-reasoning
  • qwen2.5-vl
  • sft
  • chain-of-thought
  • ms-swift

About

<!-- ### quantize_version: 2 --> <!-- ### output_tensor_quantised: 1 --> <!-- ### convert_type: hf --> <!-- ### vocab_type: --> <!-- ### tags: --> <!-- ### quants: x-f16 Q4_K_S Q2_K Q8_0 Q6_K Q3_K_M Q3_K_S Q3_K_L Q4_K_M Q5_K_S Q5_K_M IQ4_XS --> <!-- ### quants_skip: --> <!-- ### skip_mmproj: -->

static quants of https://huggingface.co/zhou777/LandAI-Base

<!-- provided-files -->

For a convenient overview and download list, visit our model page for this model.

weighted/imatrix quants are available at https://huggingface.co/mradermacher/LandAI-Base-i1-GGUF

Usage

If you are unsure how to use GGUF files, refer to one of TheBloke's READMEs for more details, including on how to concatenate multi-part files.

Provided Quants

(sorted by size, not necessarily quality. IQ-quants are often preferable over similar sized non-IQ quants)

| Link | Type | Size/GB | Notes | |:-----|:-----|--------:|:------| | GGUF | mmproj-Q8_0 | 1.0 | multi-modal supplement | | GGUF | mmproj-f16 | 1.5 | multi-modal supplement | | GGUF | Q2_K | 3.1 | | | GGUF | Q3_K_S | 3.6 | | | GGUF | Q3_K_M | 3.9 | lower quality | | GGUF | Q3_K_L | 4.2 | | | GGUF | IQ4_XS | 4.4 | | | GGUF | Q4_K_S | 4.6 | fast, recommended | | GGUF | Q4_K_M | 4.8 | fast, recommended | | GGUF | Q5_K_S | 5.4 | | | GGUF | Q5_K_M | 5.5 | | | GGUF | Q6_K | 6.4 | very good quality | | GGUF | Q8_0 | 8.2 | fast, best quality | | GGUF | f16 | 15.3 | 16 bpw, overkill |

Here is a handy graph by ikawrakow comparing some lower-quality quant types (lower is better):

image.png

And here are Artefact2's thoughts on the matter: https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9

FAQ / Model Request

See https://huggingface.co/mradermacher/model_requests for some answers to questions you might have and/or if you want some other model quantized.

Thanks

I thank my company, nethype GmbH, for letting me use its servers and providing upgrades to my workstation to enable this work in my free time.

<!-- end -->

Author: mradermacher

Likes: 1

Downloads: 0

Tags: transformers, gguf, remote-sensing, geospatial-reasoning, qwen2.5-vl, sft, chain-of-thought, ms-swift, en, zh, base_model:zhou777/LandAI-Base, base_model:quantized:zhou777/LandAI-Base, license:apache-2.0, endpoints_compatible, region:us, conversational

rikunarita/Qwen3-4B-Instruct-2507-Genius


base_model:

  • SamsungSAILMontreal/Qwen3-4B-Instruct-2507-Math
  • Qwen/Qwen3-4B-Instruct-2507
  • credshields/Solidity-CodeGen-v0.1 library_name: transformers tags:
  • mergekit
  • merge

Qwen3-4B-Instruct-2507-Genius

This is a merge of pre-trained language models created using mergekit.

Merge Details

Merge Method

This model was merged using the Model Stock merge method using Qwen/Qwen3-4B-Instruct-2507 as a base.

Models Merged

The following models were included in the merge:

Configuration

The following YAML configuration was used to produce this model:

base_model: Qwen/Qwen3-4B-Instruct-2507
dtype: float16
merge_method: model_stock
slices:
- sources:
  - model: Qwen/Qwen3-4B-Instruct-2507
    layer_range: &id001
    - 0
    - 36
    parameters:
      weight: 0.333
      epsilon: 0.05
  - model: credshields/Solidity-CodeGen-v0.1
    layer_range: *id001
    parameters:
      weight: 0.333
      epsilon: 0.05
  - model: SamsungSAILMontreal/Qwen3-4B-Instruct-2507-Math
    layer_range: *id001
    parameters:
      weight: 0.333
      epsilon: 0.05
parameters:
  normalize: true

Author: rikunarita

Likes: 1

Downloads: 0

Tags: transformers, safetensors, qwen3, text-generation, mergekit, merge, conversational, arxiv:2403.19522, base_model:Qwen/Qwen3-4B-Instruct-2507, base_model:merge:Qwen/Qwen3-4B-Instruct-2507, base_model:SamsungSAILMontreal/Qwen3-4B-Instruct-2507-Math, base_model:merge:SamsungSAILMontreal/Qwen3-4B-Instruct-2507-Math, base_model:credshields/Solidity-CodeGen-v0.1, base_model:merge:credshields/Solidity-CodeGen-v0.1, text-generation-inference, endpoints_compatible, region:us