<div style="text-align: center; margin: 1em 0 0.5em 0;">
<h1 style="font-size: 2.4em; font-weight: 800; background: linear-gradient(135deg, #a78bfa 0%, #e879f9 50%, #fb7185 100%); -webkit-background-clip: text; -webkit-text-fill-color: transparent; background-clip: text; margin: 0; padding: 0; letter-spacing: -0.02em;">Anima 2B with Qwen 3.5 4B</h1>
</div>
<style>
.gallery-img {
width: 100%;
border-radius: 8px;
transition: transform 0.3s ease, box-shadow 0.3s ease;
cursor: pointer;
display: block;
}
.gallery-img:hover {
transform: scale(1.02);
box-shadow: 0 4px 20px rgba(0,0,0,0.4);
z-index: 10;
position: relative;
}
</style>
<div style="margin: 1em auto 1.5em auto;">
<a href="https://cdn-uploads.huggingface.co/production/uploads/63820cd163e3fab40c8b41ca/BNG4ofA1c_AETKR6SGKg1.png" target="_blank"><img class="gallery-img" src="https://cdn-uploads.huggingface.co/production/uploads/63820cd163e3fab40c8b41ca/BNG4ofA1c_AETKR6SGKg1.png" /></a>
<a href="https://cdn-uploads.huggingface.co/production/uploads/63820cd163e3fab40c8b41ca/AYws3GpLN7ZDanFksy4Nr.png" target="_blank"><img class="gallery-img" src="https://cdn-uploads.huggingface.co/production/uploads/63820cd163e3fab40c8b41ca/AYws3GpLN7ZDanFksy4Nr.png" /></a>
<a href="https://cdn-uploads.huggingface.co/production/uploads/63820cd163e3fab40c8b41ca/mw8qguZ7mXd-8ZSBDeAZ9.png" target="_blank"><img class="gallery-img" src="https://cdn-uploads.huggingface.co/production/uploads/63820cd163e3fab40c8b41ca/mw8qguZ7mXd-8ZSBDeAZ9.png" /></a>
<a href="https://cdn-uploads.huggingface.co/production/uploads/63820cd163e3fab40c8b41ca/JRngXFKDlr4KuBlb__M3q.png" target="_blank"><img class="gallery-img" src="https://cdn-uploads.huggingface.co/production/uploads/63820cd163e3fab40c8b41ca/JRngXFKDlr4KuBlb__M3q.png" /></a>
<a href="https://cdn-uploads.huggingface.co/production/uploads/63820cd163e3fab40c8b41ca/Wq3iqEYRZPpLajRThRL0X.png" target="_blank"><img class="gallery-img" src="https://cdn-uploads.huggingface.co/production/uploads/63820cd163e3fab40c8b41ca/Wq3iqEYRZPpLajRThRL0X.png" /></a>
</div>
Table of Contents
- The Problem
- Understanding the Architecture
- The Scaling Problem: 4B vs 0.6B
- Discovery: The ExpRMSNorm Breakthrough
- Procrustes Alignment — Rotating One Brain to Match Another
- Per-Dimension Affine Calibration
- Recommended Settings for Users
- The Mamba2 SSM Rewrite
- Tokenizer: Why Qwen3 ≠ Qwen3.5
- Timeline & Iteration History
The Problem
Anima 2B ships with a Qwen 3 0.6B text encoder — a small, standard transformer. The model works fine, but 0.6B parameters is a significant bottleneck for understanding complex prompts. nightknocker released a Qwen 3.5 4B hybrid encoder trained for the same ecosystem, promising better text comprehension.
The catch: you can't just swap one text encoder for another. The Anima diffusion model's LLM adapter was trained against the 0.6B's specific embedding distribution. Even though both encoders output 1024-dimensional vectors, they speak completely different "languages" — different magnitude scales, different directions for the same concepts, different statistical distributions.
Our initial naive implementation loaded correctly (all 426/426 weight tensors, 4.14B parameters, no errors), produced valid embeddings with no NaN/Inf... and generated images that were consistently worse than the tiny 0.6B.
This document explains every problem we encountered and how we solved each one.
Understanding the Architecture
Qwen 3.5 4B is not a standard transformer. It's a hybrid model alternating between two fundamentally different sequence processing mechanisms:
| Property | Value |
|---|---|
| Total layers | 32 |
| SSM (Mamba2) layers | 24 (positions 0,1,2, 4,5,6, ..., 28,29,30) |
| Self-Attention layers | 8 (positions 3, 7, 11, 15, 19, 23, 27, 31) |
| Hidden size | 2560 |
| Output dimension | 1024 (after projection) |
| Vocabulary | 248,320 tokens |
| Weight format | FP8 (F8_E4M3) with BF16 norms |
The pattern is simple: every 4th layer is self-attention, the other three are SSM blocks. The final layer (31) is attention-only with no MLP. This hybrid design gives the model the long-range memory of state space models with periodic full-attention "checkpoints" for global context.
Output pipeline: The raw 2560-dim hidden states go through a learned projection:
Linear(2560 → 1024) → ExpRMSNorm(1024) → SiLU → Linear(1024 → 1024)
This maps the model's internal representation into the 1024-dim space that the Anima adapter expects.
The Scaling Problem: 4B vs 0.6B
Here's what the raw output distributions look like side by side:
| Metric | 0.6B (Original) | 4B (Raw) | Ratio |
|--------|:---:|:---:|:---:|
| Global mean | -0.068 | 0.0015 | ~45× difference |
| Global std | 3.36 | 0.33 | ~10× smaller |
| L2 norm / token | 106.6 | 10.5 | ~10× smaller |
The 4B encoder's outputs are roughly 10× smaller in magnitude than what the Anima adapter expects. Imagine whispering instructions to someone who's used to being shouted at — the signal is there, but it's far too quiet to drive the diffusion process effectively.
This isn't a bug — it's a consequence of two models with different architectures, different training procedures, and different internal normalizations producing embeddings at different scales. The 0.6B was the encoder that the adapter was trained against, so its scale IS the expected scale.
Discovery: The ExpRMSNorm Breakthrough
Before we could even think about alignment, we had to fix a fundamental error in how we interpreted the model's normalization layer.
The Mystery of the Near-Zero Weights
All 64 internal RMSNorm layers in the model have learned weights with sensible values — centered between 0.04 and 1.11. These are normal scaling factors: the model learns to emphasize some dimensions and suppress others.
But the late normalization layer (the one in the output projection) had weights centered around -0.003. Nearly zero.
With standard RMSNorm, those weights multiply the normalized output directly:
output = weight * (x / RMS(x))
If weight ≈ -0.003, you're scaling everything down to essentially nothing. And that's exactly what happened:
| Metric | Standard RMSNorm (broken) | ExpRMSNorm (fixed) |
|--------|:---:|:---:|
| Output std | 0.018 | 0.324 (18× larger) |
| L2 / token | 0.58 | 10.37 (18× larger) |
| Token diversity | 0.003 | 0.821 (274× larger!) |
| Cross-prompt similarity | 0.999 (everything identical) | 0.689 (distinguishable) |
Token diversity of 0.003 means every single token in every single prompt was being mapped to essentially the same vector. The model's understanding was being completely destroyed at the output gate.
The Fix: exp(weight) Parameterization
The late norm uses exponential weight parameterization:
output = exp(weight) * (x / RMS(x))
With weight ≈ -0.003:
- Standard:
scale = -0.003 → collapses everything
- Exponential:
scale = exp(-0.003) ≈ 0.997 → near-identity, with tiny learned perturbations
This is the difference between "scale to zero" and "scale to approximately one with fine-grained adjustments." The only reason the late norm's weights are near-zero is because it uses this parameterization — exp(0) = 1 is the neutral point.
This single fix took token diversity from 0.003 to 0.821 — from "completely collapsed" to "rich, distinguishable representations."
Procrustes Alignment — Rotating One Brain to Match Another
Even after fixing the ExpRMSNorm, the 4B generates images that don't follow the prompt well. Why? Because the 4B and 0.6B encode the same concepts in different directions.
Think of it this way: both models understand what "from the side" means, but the 0.6B might encode that as a vector pointing northeast in embedding space, while the 4B encodes it as a vector pointing southwest. The adapter was trained to interpret northeast as a side view — so when it sees southwest, it does something completely wrong.
What Is Procrustes Alignment?
Procrustes alignment finds the optimal rotation matrix R that maps one embedding space onto another:
$$R^* = \arg\min_{R} | R \cdot X_{4B} - X_{0.6B} |_F \quad \text{subject to} \quad R^T R = I$$
The constraint $R^T R = I$ means R is orthogonal — it's a pure rotation/reflection. No stretching, no squishing. Every distance between embeddings in the 4B's space is perfectly preserved. We're just reorienting the compass.
How We Computed It
We ran 41,277 prompts through both encoders and collected their mean-pooled 1024-dim embeddings. Then we applied Orthogonal Procrustes (via SVD of the cross-covariance matrix) to find the best rotation.
The results:
| | Before Alignment | After Alignment |
|---|:---:|:---:|
| Mean cosine similarity | -0.034 | 0.960 |
| Minimum cosine similarity | -0.115 | 0.766 |
Before alignment, the two encoders had negative average cosine similarity — their concept directions were essentially uncorrelated. After: 0.96 average agreement.
Per-category breakdown:
| Category | Before → After |
|---|---|
| Spatial (viewpoints, poses) | -0.034 → 0.960 |
| Pose | -0.021 → 0.943 |
| Composition | -0.028 → 0.956 |
| Character | -0.028 → 0.943 |
| Environment | -0.025 → 0.954 |
| Meta (quality tags) | -0.034 → 0.838 |
| Multi (complex prompts) | -0.027 → 0.898 |
Rotation vs. Bias Shift
The alignment has two components:
-
Rotation — The 1024×1024 orthogonal matrix R that reorients concept directions. This is always applied when alignment is enabled. It fixes what direction concepts point in, without changing magnitude.
-
Bias shift — Re-centering from the 4B's mean embedding to the 0.6B's mean embedding. The 0.6B's mean has L2≈70 while the 4B's mean has L2≈5, so the full shift dramatically changes output magnitude. This is controlled by the alignment_strength slider.
The alignment_strength parameter (0.0–1.0) only controls the bias shift, not the rotation:
x_aligned = R @ (x - mean_4b) + (1 - α) * mean_4b + α * mean_06b
α = 0.0: Rotate only, keep 4B's own magnitude
α = 0.5: Rotate + halfway bias shift (recommended starting point)
α = 1.0: Rotate + full shift to 0.6B's distribution center
Per-Dimension Affine Calibration
Beyond rotation, the two encoders also differ in their per-dimension scales. Dimension 42 in the 0.6B might have 3× the variance of dimension 42 in the 4B, while dimension 500 might be 0.5×.
The calibration computes a per-dimension affine transform:
output_calibrated[d] = scale[d] * output_4b[d] + bias[d]
Where:
scale[d] = std_06b[d] / std_4b[d]
bias[d] = mean_06b[d] - scale[d] * mean_4b[d]
Calibration statistics (from 30 diverse prompts):
| | Value |
|---|---|
| Scale range | 1.03 – 79.7 |
| Scale mean | 5.47 |
| Bias mean | -0.075 |
Most dimensions need ~5× scaling. Some need up to 80×. This makes sense given the 10× overall magnitude difference — it's not uniform across dimensions.
Note: Calibration and alignment serve different purposes. Alignment fixes directions (rotation). Calibration fixes magnitudes (per-dimension scaling). They can be used independently or together.
Recommended Settings for Users
Start Simple, Add Complexity
Step 1: Baseline (no alignment, no calibration)
use_alignment: OFF
use_calibration: OFF
output_scale: 1.0
Generate some images with your usual prompts. This gives you the raw 4B output — better text understanding, but the adapter may misinterpret concept directions.
Step 2: Add alignment at half strength
use_alignment: ON
alignment_strength: 0.5
use_calibration: OFF
output_scale: 1.0
This rotates the 4B's concept space to match the 0.6B (fixing spatial/pose understanding) while blending the magnitude halfway between the two encoders. Compare your results — you should see better prompt adherence, especially for viewpoints, poses, and spatial composition.
Step 3: Experiment with strength
- If poses/viewpoints still aren't quite right, increase
alignment_strength toward 1.0
- If the image quality or detail seems to degrade at high strength, back off toward 0.3
- The sweet spot varies by prompt type — 0.5 is a good general default
Step 4 (Optional): Try calibration
use_calibration: ON
This applies per-dimension scaling on top of alignment. It can help in some cases but may also over-correct. Test both ways and compare.
Quick Reference
| Setting | What It Does | When to Use |
|---|---|---|
| alignment OFF | Raw 4B embeddings | Baseline comparison |
| alignment ON, strength 0.0 | Rotation only, 4B magnitude | Fix concept directions without changing scale |
| alignment ON, strength 0.5 | Rotation + half bias shift | Best general starting point |
| alignment ON, strength 1.0 | Full 0.6B-like distribution | Maximum compatibility with adapter |
| calibration ON | Per-dimension affine scaling | Fine-grained magnitude matching |
| output_scale | Uniform multiplier | Last-resort manual adjustment |
The Mamba2 SSM Rewrite
Qwen 3.5 4B is not a standard transformer you can load with a config swap — 24 of its 32 layers are Mamba2 selective state space blocks, an architecture with no off-the-shelf ComfyUI support. We had to implement the full SSM from scratch.
The approach was to work directly from the reference Mamba2 implementation, mapping every projection, convolution, and recurrence step to the weight shapes we found in the checkpoint. The initial implementation ran without errors but produced garbage embeddings — every tensor shape was valid, no NaN/Inf, just wrong math.
The rewrite came down to carefully matching the reference's data flow: which projections go through the causal conv1d and which bypass it as a gate, the full multi-dimensional state recurrence (not a scalar approximation), input-dependent discretization that makes the SSM selective, and the skip connections that the architecture relies on. Several hundred million parameters that were being loaded but never actually used in the forward pass are now contributing.
The key insight was that SSM bugs are silent — the shapes all work out, gradients would flow if you were training, and the output looks like plausible floating point numbers. The only way to catch them was methodical comparison against the reference code, projection by projection.
Tokenizer: Why Qwen3 ≠ Qwen3.5
This was an easy mistake to make — and a critical one to fix.
| | Qwen 3 (0.6B) | Qwen 3.5 (4B) |
|---|:---:|:---:|
| Vocabulary size | 151,936 | 248,320 |
| Extra tokens | — | +96,384 (3 blocks of 32,128) |
The extra 96,384 tokens in Qwen 3.5 correspond exactly to T5's vocabulary size (32,128 × 3), suggesting the model was designed to bridge between Qwen and T5 embedding spaces.
Using the Qwen 3 tokenizer with the 4B model means:
- Different BPE merge rules produce different token boundaries
- Every token ID potentially maps to the wrong embedding row
- 96,384 trained embedding rows are never accessed
- The model receives garbled input it was never trained on
The node bundles the correct Qwen 3.5 tokenizer (248,320 tokens) and falls back to auto-downloading from Qwen/Qwen3.5-4B on HuggingFace if local files aren't found.
Timeline & Iteration History
v0.1.0 — Initial Release (2026-03-08)
- Full custom implementation of the hybrid Mamba2/Attention architecture
- Weight loading (426 tensors, 4.14B parameters)
- ComfyUI CLIP-compatible
- Result: Images generated but consistently worse than 0.6B
v0.2.0 — Mamba2 SSM Rewrite (2026-03-09)
- Fixed 5 critical bugs in the SSM block (conv split, gate, d_state, dt, D skip)
- ~240M previously-ignored parameters now contributing
- Result: Better internal representations, still misaligned with adapter
v0.3.0 — ExpRMSNorm Discovery (2026-03-09)
- Discovered the late norm uses
exp(weight) parameterization
- Token diversity went from 0.003 to 0.821 (274× improvement)
- Result: Meaningful, distinguishable embeddings for the first time
v0.4.0 — Alignment & Calibration (2026-03-09)
- Procrustes alignment over 41K prompts (cosine similarity: -0.03 → 0.96)
- Per-dimension affine calibration from 30 diverse prompts
- Correct Qwen 3.5 tokenizer (vocab=248,320)
- Result: Substantially improved prompt adherence and image quality
This node is open source. Contributions, testing results, and alignment experiments are welcome.