Todays AI Summary

AI Developments: OCR Advancements, FaceCLIP for Image Synthesis, and Reasoning Enhancements

Here's a look at some of the latest developments in AI, focusing on new models and research papers:

Research Highlights

  • StreamingVLM: Real-Time Video Understanding: A new paper introduces StreamingVLM, designed for real-time understanding of infinite video streams. It maintains a compact KV cache and aligns training with streaming inference. The model achieves a 66.18% win rate against GPT-4O mini on a new benchmark, Inf-Streams-Eval, and improves general VQA abilities.
  • Prompting Test-Time Scaling (P-TTS): This paper introduces P-TTS, an inference-time data augmentation strategy for enhancing LLM reasoning. By leveraging a small pool of reasoning instances and varying exemplar augmentation, P-TTS achieves significant accuracy gains on mathematical reasoning tasks, outperforming competitive baselines.
  • GraphMERT: Distilling Knowledge Graphs: GraphMERT, a small graphical encoder model, distills high-quality knowledge graphs from unstructured text. It outperforms large language models in generating reliable domain-specific knowledge graphs, achieving higher FActScore and ValidityScore on PubMed papers related to diabetes.
  • LiveOIBench: Evaluating LLMs in Competitive Programming: A new benchmark, LiveOIBench, features 403 Olympiad-level competitive programming problems. Benchmarking results show that GPT-5 achieves a notable percentile but still falls short of top human contestants, highlighting areas for improvement in reasoning and problem analysis.
  • Dyna-Mind: Learning to Simulate: Dyna-Mind is a two-stage training framework that teaches (V)LM agents to integrate simulation into their reasoning. Experiments on synthetic and realistic benchmarks demonstrate that Dyna-Mind effectively infuses simulation ability into AI agents, leading to better policies for long-horizon tasks.
  • SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models: This paper introduces the Sandwiched Policy Gradient (SPG) method for aligning diffusion language models with human preferences. SPG leverages both upper and lower bounds of the true log-likelihood, outperforming baselines in various tasks.
  • Mitigating Overthinking through Reasoning Shaping: This paper introduces Group Relative Segment Penalization (GRSP), a step-level method to regularize reasoning in large reasoning models. GRSP achieves superior token efficiency without heavily compromising accuracy, especially with harder problems.

Model Spotlight

  • Nanonets-OCR2-3B: This model by Nanonets is designed for transforming documents into structured markdown. It features LaTeX equation recognition, intelligent image description, signature detection, watermark extraction, smart checkbox handling, and complex table extraction. It supports multiple languages and includes visual question answering capabilities. The model has received 47 likes.
  • Nanonets-OCR2-1.5B-exp: Another model in the Nanonets-OCR2 family, this version offers similar features to the 3B model but with a smaller size. It has received 19 likes.
  • ByteDance/FaceCLIP: This model focuses on ID-preserving image synthesis. It learns a joint ID-textual representation to generate photorealistic portraits with better identity retention and text alignment. It has received 13 likes.

Key Takeaways

  • OCR Advancements: The Nanonets-OCR2 models demonstrate significant progress in optical character recognition, offering advanced features for document processing and understanding.
  • Image Synthesis: FaceCLIP introduces a novel approach to identity-preserving image generation, showcasing the potential of joint ID-textual representations.
  • Reasoning Enhancements: Research on P-TTS, SPG, and GRSP highlights the importance of data augmentation, policy gradient methods, and reasoning shaping in improving the reasoning capabilities of LLMs.
  • Video Understanding: StreamingVLM addresses the challenge of real-time video understanding, offering a promising solution for processing near-infinite video streams.
  • Knowledge Graphs: GraphMERT demonstrates the effectiveness of small graphical models in distilling high-quality knowledge graphs from unstructured text, providing a reliable and scalable neurosymbolic approach.

AI Papers for 2026-03-11

Scale Space Diffusion

Diffusion models degrade images through noise, and reversing this process reveals an information hierarchy across timesteps. Scale-space theory exhibits a similar hierarchy via low-pass filtering. We formalize this connection and show that highly noisy diffusion states contain no more information than small, downsampled images - raising the question of why they must be processed at full resolution. To address this, we fuse scale spaces into the diffusion process by formulating a family of diffusion models with generalized linear degradations and practical implementations. Using downsampling as the degradation yields our proposed Scale Space Diffusion. To support Scale Space Diffusion, we introduce Flexi-UNet, a UNet variant that performs resolution-preserving and resolution-increasing denoising using only the necessary parts of the network. We evaluate our framework on CelebA and ImageNet and analyze its scaling behavior across resolutions and network depths. Our project website ( https://prateksha.github.io/projects/scale-space-diffusion/ ) is available publicly.

Agentic Critical Training

Training large language models (LLMs) as autonomous agents often begins with imitation learning, but it only teaches agents what to do without understanding why: agents never contrast successful actions against suboptimal alternatives and thus lack awareness of action quality. Recent approaches attempt to address this by introducing self-reflection supervision derived from contrasts between expert and alternative actions. However, the training paradigm fundamentally remains imitation learning: the model imitates pre-constructed reflection text rather than learning to reason autonomously. We propose Agentic Critical Training (ACT), a reinforcement learning paradigm that trains agents to identify the better action among alternatives. By rewarding whether the model's judgment is correct, ACT drives the model to autonomously develop reasoning about action quality, producing genuine self-reflection rather than imitating it. Across three challenging agent benchmarks, ACT consistently improves agent performance when combined with different post-training methods. It achieves an average improvement of 5.07 points over imitation learning and 4.62 points over reinforcement learning. Compared to approaches that inject reflection capability through knowledge distillation, ACT also demonstrates clear advantages, yielding an average improvement of 2.42 points. Moreover, ACT enables strong out-of-distribution generalization on agentic benchmarks and improves performance on general reasoning benchmarks without any reasoning-specific training data, highlighting the value of our method. These results suggest that ACT is a promising path toward developing more reflective and capable LLM agents.

Evaluating Financial Intelligence in Large Language Models: Benchmarking SuperInvesting AI with LLM Engines

Large language models are increasingly used for financial analysis and investment research, yet systematic evaluation of their financial reasoning capabilities remains limited. In this work, we introduce the AI Financial Intelligence Benchmark (AFIB), a multi-dimensional evaluation framework designed to assess financial analysis capabilities across five dimensions: factual accuracy, analytical completeness, data recency, model consistency, and failure patterns. We evaluate five AI systems: GPT, Gemini, Perplexity, Claude, and SuperInvesting, using a dataset of 95+ structured financial analysis questions derived from real-world equity research tasks. The results reveal substantial differences in performance across models. Within this benchmark setting, SuperInvesting achieves the highest aggregate performance, with an average factual accuracy score of 8.96/10 and the highest completeness score of 56.65/70, while also demonstrating the lowest hallucination rate among evaluated systems. Retrieval-oriented systems such as Perplexity perform strongly on data recency tasks due to live information access but exhibit weaker analytical synthesis and consistency. Overall, the results highlight that financial intelligence in large language models is inherently multi-dimensional, and systems that combine structured financial data access with analytical reasoning capabilities provide the most reliable performance for complex investment research workflows.

A Multi-Objective Optimization Approach for Sustainable AI-Driven Entrepreneurship in Resilient Economies

The rapid advancement of artificial intelligence (AI) technologies presents both unprecedented opportunities and significant challenges for sustainable economic development. While AI offers transformative potential for addressing environmental challenges and enhancing economic resilience, its deployment often involves substantial energy consumption and environmental costs. This research introduces the EcoAI-Resilience framework, a multi-objective optimization approach designed to maximize the sustainability benefits of AI deployment while minimizing environmental costs and enhancing economic resilience. The framework addresses three critical objectives through mathematical optimization: sustainability impact maximization, economic resilience enhancement, and environmental cost minimization. The methodology integrates diverse data sources, including energy consumption metrics, sustainability indicators, economic performance data, and entrepreneurship outcomes across 53 countries and 14 sectors from 2015-2024. Our experimental validation demonstrates exceptional performance with R scores exceeding 0.99 across all model components, significantly outperforming baseline methods, including Linear Regression (R = 0.943), Random Forest (R = 0.957), and Gradient Boosting (R = 0.989). The framework successfully identifies optimal AI deployment strategies featuring 100\% renewable energy integration, 80% efficiency improvement targets, and optimal investment levels of $202.48 per capita. Key findings reveal strong correlations between economic complexity and resilience (r = 0.82), renewable energy adoption and sustainability outcomes (r = 0.71), and demonstrate significant temporal improvements in AI readiness (+1.12 points/year) and renewable energy adoption (+0.67 year) globally.

Split Federated Learning Architectures for High-Accuracy and Low-Delay Model Training

Can we find a network architecture for ML model training so as to optimize training loss (and thus, accuracy) in Split Federated Learning (SFL)? And can this architecture also reduce training delay and communication overhead? While accuracy is not influenced by how we split the model in ordinary, state-of-the-art SFL, in this work we answer the questions above in the affirmative. Recent Hierarchical SFL (HSFL) architectures adopt a three-tier training structure consisting of clients, (local) aggregators, and a central server. In this architecture, the model is partitioned at two partitioning layers into three sub-models, which are executed across the three tiers. Despite their merits, HSFL architectures overlook the impact of the partitioning layers and client-to-aggregator assignments on accuracy, delay, and overhead. This work explicitly captures the impact of the partitioning layers and client-to-aggregator assignments on accuracy, delay and overhead by formulating a joint optimization problem. We prove that the problem is NP-hard and propose the first accuracy-aware heuristic algorithm that explicitly accounts for model accuracy, while remaining delay-efficient. Simulation results on public datasets show that our approach can improve accuracy by 3%, while reducing delay by 20% and overhead by 50%, compared to state-of-the-art SFL and HSFL schemes.

Benchmarking Language Modeling for Lossless Compression of Full-Fidelity Audio

Autoregressive "language" models (LMs) trained on raw waveforms can be repurposed for lossless audio compression, but prior work is limited to 8-bit audio, leaving open whether such approaches work for practical settings (16/24-bit) and can compete with existing codecs. We benchmark LM-based compression on full-fidelity audio across diverse domains (music, speech, bioacoustics), sampling rates (16kHz-48kHz), and bit depths (8, 16, 24-bit). Standard sample-level tokenization becomes intractable at higher bit depths due to vocabulary size (65K for 16-bit; 16.7M for 24-bit). We propose Trilobyte, a byte-level tokenization schema for full resolution audio, improving vocabulary scaling from $O(2^{b})$ to $O(1)$ and enabling the first tractable 24-bit LM-based lossless compression. While LMs consistently outperform FLAC and yield state-of-the-art compression at 8-bit and 16-bit, we observe that compression gains become more modest as bit depth increases beyond 8-bit.

A New Lower Bound for the Random Offerer Mechanism in Bilateral Trade using AI-Guided Evolutionary Search

The celebrated Myerson--Satterthwaite theorem shows that in bilateral trade, no mechanism can be simultaneously fully efficient, Bayesian incentive compatible (BIC), and budget balanced (BB). This naturally raises the question of how closely the gains from trade (GFT) achievable by a BIC and BB mechanism can approximate the first-best (fully efficient) benchmark. The optimal BIC and BB mechanism is typically complex and highly distribution-dependent, making it difficult to characterize directly. Consequently, much of the literature analyzes simpler mechanisms such as the Random-Offerer (RO) mechanism and establishes constant-factor guarantees relative to the first-best GFT. An important open question concerns the worst-case performance of the RO mechanism relative to first-best (FB) efficiency. While it was originally hypothesized that the approximation ratio $\frac{\text{GFT}_{\text{FB}}}{\text{GFT}_{\text{RO}}}$ is bounded by $2$, recent work provided counterexamples to this conjecture: Cai et al. proved that the ratio can be strictly larger than $2$, and Babaioff et al. exhibited an explicit example with ratio approximately $2.02$. In this work, we employ AlphaEvolve, an AI-guided evolutionary search framework, to explore the space of value distributions. We identify a new worst-case instance that yields an improved lower bound of $\frac{\text{GFT}_{\text{FB}}}{\text{GFT}_{\text{RO}}} \ge \textbf{2.0749}$. This establishes a new lower bound on the worst-case performance of the Random-Offerer mechanism, demonstrating a wider efficiency gap than previously known.

OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning

We introduce OfficeQA Pro, a benchmark for evaluating AI agents on grounded, multi-document reasoning over a large and heterogeneous document corpus. The corpus consists of U.S. Treasury Bulletins spanning nearly 100 years, comprising 89,000 pages and over 26 million numerical values. OfficeQA Pro consists of 133 questions that require precise document parsing, retrieval, and analytical reasoning across both unstructured text and tabular data. Frontier LLMs including Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro Preview achieve less than 5% accuracy on OfficeQA Pro when relying on parametric knowledge, and less than 12% with additional access to the web. When provided directly with the document corpus, frontier agents still struggle on over half of questions, scoring 34.1% on average. We find that providing agents with a structured document representation produced by Databricks' ai_parse_document yields a 16.1% average relative performance gain across agents. We conduct additional ablations to study the effects of model selection, table representation, retrieval strategy, and test-time scaling on performance. Despite these improvements, significant headroom remains before agents can be considered reliable at enterprise-grade grounded reasoning.

CoCo: Code as CoT for Text-to-Image Preview and Rare Concept Generation

Recent advancements in Unified Multimodal Models (UMMs) have significantly advanced text-to-image (T2I) generation, particularly through the integration of Chain-of-Thought (CoT) reasoning. However, existing CoT-based T2I methods largely rely on abstract natural-language planning, which lacks the precision required for complex spatial layouts, structured visual elements, and dense textual content. In this work, we propose CoCo (Code-as-CoT), a code-driven reasoning framework that represents the reasoning process as executable code, enabling explicit and verifiable intermediate planning for image generation. Given a text prompt, CoCo first generates executable code that specifies the structural layout of the scene, which is then executed in a sandboxed environment to render a deterministic draft image. The model subsequently refines this draft through fine-grained image editing to produce the final high-fidelity result. To support this training paradigm, we construct CoCo-10K, a curated dataset containing structured draft-final image pairs designed to teach both structured draft construction and corrective visual refinement. Empirical evaluations on StructT2IBench, OneIG-Bench, and LongText-Bench show that CoCo achieves improvements of +68.83%, +54.8%, and +41.23% over direct generation, while also outperforming other generation methods empowered by CoT. These results demonstrate that executable code is an effective and reliable reasoning paradigm for precise, controllable, and structured text-to-image generation. The code is available at: https://github.com/micky-li-hd/CoCo

PostTrainBench: Can LLM Agents Automate LLM Post-Training?

AI agents have become surprisingly proficient at software engineering over the past year, largely due to improvements in reasoning capabilities. This raises a deeper question: can these systems extend their capabilities to automate AI research itself? In this paper, we explore post-training, the critical phase that turns base LLMs into useful assistants. We introduce PostTrainBench to benchmark how well LLM agents can perform post-training autonomously under bounded compute constraints (10 hours on one H100 GPU). We ask frontier agents (e.g., Claude Code with Opus 4.6) to optimize the performance of a base LLM on a particular benchmark (e.g., Qwen3-4B on AIME). Importantly, we do not provide any predefined strategies to the agents and instead give them full autonomy to find necessary information on the web, run experiments, and curate data. We find that frontier agents make substantial progress but generally lag behind instruction-tuned LLMs from leading providers: 23.2% for the best agent vs. 51.1% for official instruction-tuned models. However, agents can exceed instruction-tuned models in targeted scenarios: GPT-5.1 Codex Max achieves 89% on BFCL with Gemma-3-4B vs. 67% for the official model. We also observe several failure modes worth flagging. Agents sometimes engage in reward hacking: training on the test set, downloading existing instruction-tuned checkpoints instead of training their own, and using API keys they find to generate synthetic data without authorization. These behaviors are concerning and highlight the importance of careful sandboxing as these systems become more capable. Overall, we hope PostTrainBench will be useful for tracking progress in AI R&D automation and for studying the risks that come with it. Website and code are available at https://posttrainbench.com/.

AI Models

HauhauCS/Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive


license: apache-2.0 tags:

  • uncensored
  • qwen3.5
  • moe
  • gguf
  • vision
  • multimodal language:
  • en
  • zh
  • multilingual pipeline_tag: image-text-to-text base_model: Qwen/Qwen3.5-35B-A3B

Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive

Qwen3.5-35B-A3B uncensored by HauhauCS. 0/465 refusals.

About

No changes to datasets or capabilities. Fully functional, 100% of what the original authors intended - just without the refusals.

These are meant to be the best lossless uncensored models out there.

Aggressive Variant

Stronger uncensoring — model is fully unlocked and won't refuse prompts. May occasionally append short disclaimers (baked into base model training, not refusals) but full content is always generated.

For a more conservative uncensor that keeps some safety guardrails, check the Balanced variant when it's available.

Downloads

| File | Quant | Size | |------|-------|------| | Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive-BF16.gguf | BF16 | 65 GB | | Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive-Q8_0.gguf | Q8_0 | 35 GB | | Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive-Q6_K.gguf | Q6_K | 27 GB | | Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive-Q5_K_M.gguf | Q5_K_M | 24 GB | | Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf | Q4_K_M | 20 GB | | Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive-IQ4_XS.gguf | IQ4_XS | 18 GB | | Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive-Q3_K_M.gguf | Q3_K_M | 16 GB | | Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive-IQ3_M.gguf | IQ3_M | 15 GB | | Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive-IQ2_M.gguf | IQ2_M | 11 GB | | mmproj-Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive-f16.gguf | mmproj (f16) | 858 MB |

All quants generated with importance matrix (imatrix) for optimal quality preservation on abliterated weights.

Specs

  • 35B total parameters, ~3B active per forward pass (MoE)
  • 256 experts, 8 routed + 1 shared per token
  • Hybrid architecture: Gated DeltaNet linear attention + full softmax attention (3:1 ratio)
  • 40 layers, pattern: 10 x (3 x DeltaNet-MoE + 1 x Attention-MoE)
  • 262K native context (extendable to 1M with YaRN)
  • Natively multimodal (text, image, video)
  • Multi-token prediction (MTP) support
  • 248K vocabulary, 201 languages
  • Based on Qwen/Qwen3.5-35B-A3B

Recommended Settings

From the official Qwen authors:

Thinking mode (default):

  • General: temperature=1.0, top_p=0.95, top_k=20, min_p=0, presence_penalty=1.5
  • Coding/precise tasks: temperature=0.6, top_p=0.95, top_k=20, min_p=0, presence_penalty=0

Non-thinking mode:

  • General: temperature=0.7, top_p=0.8, top_k=20, min_p=0, presence_penalty=1.5
  • Reasoning tasks: temperature=1.0, top_p=1.0, top_k=40, min_p=0, presence_penalty=2.0

Important:

  • Keep at least 128K context to preserve thinking capabilities
  • Use --jinja flag with llama.cpp for proper chat template handling
  • Vision support requires the mmproj file alongside the main GGUF

Usage

Works with llama.cpp, LM Studio, Jan, koboldcpp, and other GGUF-compatible runtimes.

# Text only
llama-cli -m Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf \
  --jinja -c 131072 -ngl 99

# With vision
llama-cli -m Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf \
  --mmproj mmproj-Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive-f16.gguf \
  --jinja -c 131072 -ngl 99

Note: LM Studio may show 256x2.6B in the params column instead of 35B-A3B — this is a cosmetic metadata quirk, the model runs correctly.

Other Formats

  • GGUF (this repo)
  • GPTQ — coming soon

Other Models

Author: HauhauCS

Likes: 17

Downloads: 0

Tags: gguf, uncensored, qwen3.5, moe, vision, multimodal, image-text-to-text, en, zh, multilingual, base_model:Qwen/Qwen3.5-35B-A3B, base_model:quantized:Qwen/Qwen3.5-35B-A3B, license:apache-2.0, endpoints_compatible, region:us, conversational

tencent/Sequential-Hidden-Decoding-8B-n4


license: other license_name: sequential-hidden-decoding license_link: LICENSE base_model:

  • Qwen/Qwen3-8B-Base tags:
  • sequential-hidden-decoding
  • pretrained
  • base-model

Sequential-Hidden-Decoding-8B-n4

This is the n=4 variant of Sequential Hidden Decoding, a method that scales sequence length by n× with only additional Embedding parameters — same Transformer, more compute per token.

  • Base model: Qwen3-8B-Base
  • Scale:
  • Additional Embedding Params: 3.1B
  • Training Tokens: 150B
  • Dtype: bfloat16

Note: This is a base model (not instruction-tuned). It is intended for benchmarking, text completion, and as a foundation for downstream fine-tuning (SFT / RLHF). For conversational or instruction-following use cases, please fine-tune on your own data.

Key Idea

Prepare n independent Embedding matrices to encode the same token sequence n times, interleave the results, and feed the n×-length sequence into the same Transformer. Only the last embedding of each token computes the next-token loss, while the preceding embeddings serve as implicit reasoning steps in a continuous latent space.

Results

| Benchmark | # Shots | 8B Baseline | 8B scale n=2 | 8B scale n=4 | 8B scale n=8 | |-----------|:-------:|:-----------:|:------------:|:------------:|:------------:| | BBH (EM) | 3-shot | 78.8 | 81.3 | 83.0 | 83.9 | | MMLU (EM) | 5-shot | 79.8 | 80.9 | 81.9 | 82.2 | | MBPP+ (Pass@1) | 1-shot | 66.7 | 69.4 | 68.7 | 69.4 | | MATH (LLM-judge) | 4-shot | 56.0 | 58.2 | 60.0 | 61.1 | | ARC-C | 25-shot | 93.9 | 94.3 | 94.4 | 94.7 | | Hellaswag | 10-shot | 79.7 | 83.1 | 85.0 | 85.3 | | GSM8K | 4-shot | 92.5 | 93.3 | 93.9 | 94.6 |

Serving (SGLang)

This model requires a patched version of SGLang for inference. See the project page for installation options (Docker image, forked repo, or manual patch).

python -m sglang.launch_server \
    --model-path tencent/Sequential-Hidden-Decoding-8B-n4 \
    --trust-remote-code \
    --tp-size 1 \
    --port 30000 --host 0.0.0.0 \
    --chunked-prefill-size -1 \
    --attention-backend fa3 \
    --mem-fraction-static 0.82 \
    --max-running-requests 32 \
    --context-length 131072 \
    --cuda-graph-max-bs 128 \
    --cuda-graph-bs 1 2 4 8 16 32 64 128
from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
response = client.completions.create(
    model="tencent/Sequential-Hidden-Decoding-8B-n4",
    prompt="The meaning of life is",
    max_tokens=128,
    temperature=0,
)
print(response.choices[0].text)

All Models

| Model | Scale | Embedding Params | Training Tokens | |-------|:-----:|:----------------:|:---------------:| | Sequential-Hidden-Decoding-8B-n2 | 2× | 1.9B | 75B | | Sequential-Hidden-Decoding-8B-n4 | 4× | 3.1B | 150B | | Sequential-Hidden-Decoding-8B-n8 | 8× | 5.6B | 187B |

Citation

@article{hidden_decoding_2026,
  title   = {Hidden Decoding: Scaling Sequence Length in Pretraining},
  year    = {2026},
  url     = {https://welm.weixin.qq.com/posts/hidden_decoding/}
}

License

This model is released under the License Terms of Sequential-Hidden-Decoding.

Author: tencent

Likes: 8

Downloads: 0

Tags: safetensors, qwen3_scale_seq, sequential-hidden-decoding, pretrained, base-model, custom_code, base_model:Qwen/Qwen3-8B-Base, base_model:finetune:Qwen/Qwen3-8B-Base, license:other, region:us

tencent/Sequential-Hidden-Decoding-8B-n2


license: other license_name: sequential-hidden-decoding license_link: LICENSE base_model:

  • Qwen/Qwen3-8B-Base tags:
  • sequential-hidden-decoding
  • pretrained
  • base-model

Sequential-Hidden-Decoding-8B-n2

This is the n=2 variant of Sequential Hidden Decoding, a method that scales sequence length by n× with only additional Embedding parameters — same Transformer, more compute per token.

  • Base model: Qwen3-8B-Base
  • Scale:
  • Additional Embedding Params: 1.9B
  • Training Tokens: 75B
  • Dtype: bfloat16

Note: This is a base model (not instruction-tuned). It is intended for benchmarking, text completion, and as a foundation for downstream fine-tuning (SFT / RLHF). For conversational or instruction-following use cases, please fine-tune on your own data.

Key Idea

Prepare n independent Embedding matrices to encode the same token sequence n times, interleave the results, and feed the n×-length sequence into the same Transformer. Only the last embedding of each token computes the next-token loss, while the preceding embeddings serve as implicit reasoning steps in a continuous latent space.

Results

| Benchmark | # Shots | 8B Baseline | 8B scale n=2 | 8B scale n=4 | 8B scale n=8 | |-----------|:-------:|:-----------:|:------------:|:------------:|:------------:| | BBH (EM) | 3-shot | 78.8 | 81.3 | 83.0 | 83.9 | | MMLU (EM) | 5-shot | 79.8 | 80.9 | 81.9 | 82.2 | | MBPP+ (Pass@1) | 1-shot | 66.7 | 69.4 | 68.7 | 69.4 | | MATH (LLM-judge) | 4-shot | 56.0 | 58.2 | 60.0 | 61.1 | | ARC-C | 25-shot | 93.9 | 94.3 | 94.4 | 94.7 | | Hellaswag | 10-shot | 79.7 | 83.1 | 85.0 | 85.3 | | GSM8K | 4-shot | 92.5 | 93.3 | 93.9 | 94.6 |

Serving (SGLang)

This model requires a patched version of SGLang for inference. See the project page for installation options (Docker image, forked repo, or manual patch).

python -m sglang.launch_server \
    --model-path tencent/Sequential-Hidden-Decoding-8B-n2 \
    --trust-remote-code \
    --tp-size 1 \
    --port 30000 --host 0.0.0.0 \
    --chunked-prefill-size -1 \
    --attention-backend fa3 \
    --mem-fraction-static 0.82 \
    --max-running-requests 32 \
    --context-length 131072 \
    --cuda-graph-max-bs 128 \
    --cuda-graph-bs 1 2 4 8 16 32 64 128
from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
response = client.completions.create(
    model="tencent/Sequential-Hidden-Decoding-8B-n2",
    prompt="The meaning of life is",
    max_tokens=128,
    temperature=0,
)
print(response.choices[0].text)

All Models

| Model | Scale | Embedding Params | Training Tokens | |-------|:-----:|:----------------:|:---------------:| | Sequential-Hidden-Decoding-8B-n2 | 2× | 1.9B | 75B | | Sequential-Hidden-Decoding-8B-n4 | 4× | 3.1B | 150B | | Sequential-Hidden-Decoding-8B-n8 | 8× | 5.6B | 187B |

Citation

@article{hidden_decoding_2026,
  title   = {Hidden Decoding: Scaling Sequence Length in Pretraining},
  year    = {2026},
  url     = {https://welm.weixin.qq.com/posts/hidden_decoding/}
}

License

This model is released under the License Terms of Sequential-Hidden-Decoding.

Author: tencent

Likes: 7

Downloads: 0

Tags: safetensors, qwen3_scale_seq, sequential-hidden-decoding, pretrained, base-model, custom_code, base_model:Qwen/Qwen3-8B-Base, base_model:finetune:Qwen/Qwen3-8B-Base, license:other, region:us

tencent/Sequential-Hidden-Decoding-8B-n8


license: other license_name: sequential-hidden-decoding license_link: LICENSE base_model:

  • Qwen/Qwen3-8B-Base tags:
  • sequential-hidden-decoding
  • pretrained
  • base-model

Sequential-Hidden-Decoding-8B-n8

This is the n=8 variant of Sequential Hidden Decoding, a method that scales sequence length by n× with only additional Embedding parameters — same Transformer, more compute per token.

  • Base model: Qwen3-8B-Base
  • Scale:
  • Additional Embedding Params: 5.6B
  • Training Tokens: 187B
  • Dtype: bfloat16

Note: This is a base model (not instruction-tuned). It is intended for benchmarking, text completion, and as a foundation for downstream fine-tuning (SFT / RLHF). For conversational or instruction-following use cases, please fine-tune on your own data.

Key Idea

Prepare n independent Embedding matrices to encode the same token sequence n times, interleave the results, and feed the n×-length sequence into the same Transformer. Only the last embedding of each token computes the next-token loss, while the preceding embeddings serve as implicit reasoning steps in a continuous latent space.

Results

| Benchmark | # Shots | 8B Baseline | 8B scale n=2 | 8B scale n=4 | 8B scale n=8 | |-----------|:-------:|:-----------:|:------------:|:------------:|:------------:| | BBH (EM) | 3-shot | 78.8 | 81.3 | 83.0 | 83.9 | | MMLU (EM) | 5-shot | 79.8 | 80.9 | 81.9 | 82.2 | | MBPP+ (Pass@1) | 1-shot | 66.7 | 69.4 | 68.7 | 69.4 | | MATH (LLM-judge) | 4-shot | 56.0 | 58.2 | 60.0 | 61.1 | | ARC-C | 25-shot | 93.9 | 94.3 | 94.4 | 94.7 | | Hellaswag | 10-shot | 79.7 | 83.1 | 85.0 | 85.3 | | GSM8K | 4-shot | 92.5 | 93.3 | 93.9 | 94.6 |

Serving (SGLang)

This model requires a patched version of SGLang for inference. See the project page for installation options (Docker image, forked repo, or manual patch).

python -m sglang.launch_server \
    --model-path tencent/Sequential-Hidden-Decoding-8B-n8 \
    --trust-remote-code \
    --tp-size 1 \
    --port 30000 --host 0.0.0.0 \
    --chunked-prefill-size -1 \
    --attention-backend fa3 \
    --mem-fraction-static 0.82 \
    --max-running-requests 32 \
    --context-length 131072 \
    --cuda-graph-max-bs 128 \
    --cuda-graph-bs 1 2 4 8 16 32 64 128
from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
response = client.completions.create(
    model="tencent/Sequential-Hidden-Decoding-8B-n8",
    prompt="The meaning of life is",
    max_tokens=128,
    temperature=0,
)
print(response.choices[0].text)

All Models

| Model | Scale | Embedding Params | Training Tokens | |-------|:-----:|:----------------:|:---------------:| | Sequential-Hidden-Decoding-8B-n2 | 2× | 1.9B | 75B | | Sequential-Hidden-Decoding-8B-n4 | 4× | 3.1B | 150B | | Sequential-Hidden-Decoding-8B-n8 | 8× | 5.6B | 187B |

Citation

@article{hidden_decoding_2026,
  title   = {Hidden Decoding: Scaling Sequence Length in Pretraining},
  year    = {2026},
  url     = {https://welm.weixin.qq.com/posts/hidden_decoding/}
}

License

This model is released under the License Terms of Sequential-Hidden-Decoding.

Author: tencent

Likes: 6

Downloads: 0

Tags: safetensors, qwen3_scale_seq, sequential-hidden-decoding, pretrained, base-model, custom_code, base_model:Qwen/Qwen3-8B-Base, base_model:finetune:Qwen/Qwen3-8B-Base, license:other, region:us

DreamFast/gemma-3-12b-it-heretic-v2


license: gemma library_name: transformers pipeline_tag: text-generation base_model: google/gemma-3-12b-it base_model_relation: quantized language:

  • en tags:
  • abliteration
  • heretic
  • uncensored
  • gemma
  • ltx-2
  • comfyui
  • video-generation
  • text-encoder
  • nvfp4
  • blackwell

Gemma 3 12B IT - Heretic v2 (Abliterated)

An abliterated version of Google's Gemma 3 12B IT created using Heretic v1.2.0. This model has reduced refusals while maintaining model quality, making it suitable as an uncensored text encoder for video generation models like LTX-2.

You can see the docker, scripts and configurations used to make these files on Heretic Docker Github.

What's new in v2

  • Heretic v1.2.0 with 200 trials (v1 used v1.1.0 with 100 trials)
  • Better trial selection: Trial 174 — 8/100 refusals at KL 0.0801 (v1: Trial 99, 7/100 refusals at KL 0.0826)
  • Vision preserved: All ComfyUI variants keep vision_model and multi_modal_projector keys for I2V prompt enhancement
  • NVFP4 quantization: ComfyUI-native 4-bit format for Blackwell GPUs (~3x smaller than bf16)
  • Updated GGUF support: ComfyUI-GGUF now has merged Gemma 3 support (PR #402)

Model Details

  • Base Model: google/gemma-3-12b-it
  • Abliteration Method: Heretic v1.2.0
  • Trials: 200
  • Trial Selected: Trial 174
  • Refusals: 8/100 (vs 100/100 original)
  • KL Divergence: 0.0801 (minimal model damage)

Files

HuggingFace Format (for transformers, llama.cpp conversion)

model-00001-of-00005.safetensors
model-00002-of-00005.safetensors
model-00003-of-00005.safetensors
model-00004-of-00005.safetensors
model-00005-of-00005.safetensors
config.json
tokenizer.model
tokenizer.json
tokenizer_config.json

ComfyUI Format (with vision, for LTX-2 T2V and I2V)

comfyui/gemma-3-12b-it-heretic-v2.safetensors              # bf16, 23GB
comfyui/gemma-3-12b-it-heretic-v2_fp8_e4m3fn.safetensors   # fp8, 12GB
comfyui/gemma-3-12b-it-heretic-v2_nvfp4.safetensors        # nvfp4, 7.8GB

All ComfyUI variants include vision (vision_model and multi_modal_projector weights). The vision weights are unused during T2V (text-to-video) and add minimal overhead (~1 GB). For I2V (image-to-video) workflows using TextGenerateLTX2Prompt with an image input, the vision weights are required.

GGUF Format (for llama.cpp and ComfyUI-GGUF)

| Quant | Size | Notes | |-------|------|-------| | F16 | 22GB | Lossless reference | | Q8_0 | 12GB | Excellent quality | | Q6_K | 9.0GB | Very good quality | | Q5_K_M | 7.9GB | Good quality | | Q5_K_S | 7.7GB | Slightly smaller Q5 | | Q4_K_M | 6.8GB | Recommended balance | | Q4_K_S | 6.5GB | Smaller Q4 variant | | Q3_K_M | 5.6GB | For low VRAM only |

gguf/gemma-3-12b-it-heretic-v2-f16.gguf
gguf/gemma-3-12b-it-heretic-v2-Q8_0.gguf
gguf/gemma-3-12b-it-heretic-v2-Q6_K.gguf
gguf/gemma-3-12b-it-heretic-v2-Q5_K_M.gguf
gguf/gemma-3-12b-it-heretic-v2-Q5_K_S.gguf
gguf/gemma-3-12b-it-heretic-v2-Q4_K_M.gguf
gguf/gemma-3-12b-it-heretic-v2-Q4_K_S.gguf
gguf/gemma-3-12b-it-heretic-v2-Q3_K_M.gguf

NVFP4 Notes

The NVFP4 (4-bit floating point, E2M1) variants use ComfyUI's native quantization format. They are ~3x smaller than bf16 and load natively in ComfyUI without any plugins. Blackwell GPUs (RTX 5090/5080, SM100+) can use native FP4 tensor cores for best performance, but ComfyUI also supports software dequantization on older GPUs (tested working on RTX 4090).

Do abliterated models make a difference for LTX-2?

I had a deep dive into this topic and found that the impact is nuanced. Abliteration does alter the embeddings Gemma produces, which slightly changes the generated video. However, there are fundamental limitations:

  • Gemma doesn't know what it wasn't trained on. The base model was never trained on more taboo content. Abliteration removes refusals, but the model simply doesn't have knowledge of things it was never exposed to. Even chatting with the heretic model in llama.cpp, it doesn't refuse — it just doesn't know.
  • LTX-2 was trained on original Gemma embeddings. The DiT expects the embedding distribution from the unmodified text encoder. Fine-tuning the text encoder itself would break the DiT, as it wouldn't know what to do with the new embedding distribution and would produce strange results.
  • Most abliteration happens on layer 48 (the final decision-making layer), but LTX-2 averages across all layers, which may wash out the difference.

A potential approach would be combining a fine-tuned abliterated text encoder with a LoRA trained to understand the new embeddings. LoRAs for LTX exist, but no fine-tuned text encoders have been released yet as far as I know.

That said, abliteration still removes the soft censorship in the embeddings, which can result in more faithful prompt encoding for creative content.

Usage

With Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "DreamFast/gemma-3-12b-it-heretic-v2",
    device_map="auto",
    torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained("DreamFast/gemma-3-12b-it-heretic-v2")

prompt = "Write a story about a bank heist"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

With ComfyUI (LTX-2)

  1. Download a ComfyUI format file:

    • FP8 (recommended): comfyui/gemma-3-12b-it-heretic-v2_fp8_e4m3fn.safetensors (12GB)
    • NVFP4 (smallest): comfyui/gemma-3-12b-it-heretic-v2_nvfp4.safetensors (7.8GB)
    • bf16 (full precision): comfyui/gemma-3-12b-it-heretic-v2.safetensors (23GB)
  2. Place in ComfyUI/models/text_encoders/

  3. In your LTX-2 workflow, use the LTXAVTextEncoderLoader node and select the heretic file

Tip: For multi-GPU setups or CPU offloading, check out ComfyUI-LTX2-MultiGPU for optimized LTX-2 workflows.

With ComfyUI-GGUF

GGUF support for Gemma 3 text encoders is now merged in ComfyUI-GGUF (PR #402).

  1. Download a GGUF file (Q4_K_M recommended for most setups)
  2. Place in ComfyUI/models/text_encoders/
  3. Use the DualClipLoader (GGUF) node:
    • CLIP 1: the Gemma 3 GGUF file
    • CLIP 2: embedding connectors from Kijai/LTXV2_comfy (use the dev connectors, not distilled)

Note: GGUF text encoders are text-only (no vision). For I2V prompt enhancement with image input, use the safetensors variants.

With llama.cpp

# Using llama-server
llama-server -m gemma-3-12b-it-heretic-v2-Q4_K_M.gguf

# Or with llama-cli
llama-cli -m gemma-3-12b-it-heretic-v2-Q4_K_M.gguf -p "Write a story about a bank heist"

Why Abliterate?

Even when Gemma doesn't outright refuse a prompt, it may "sanitize" or weaken certain concepts in the embeddings. For video generation with LTX-2, this can result in:

  • Weaker adherence to creative prompts
  • Softened or altered visual outputs
  • Less faithful representation of requested content

Abliteration removes this soft censorship, resulting in more faithful prompt encoding.

Abliteration Process

Created using Heretic v1.2.0 with 200 optimization trials:

? Which trial do you want to use?
  [Trial  80] Refusals:  0/100, KL divergence: 0.6098
  [Trial  66] Refusals:  2/100, KL divergence: 0.2087
  [Trial  75] Refusals:  3/100, KL divergence: 0.1378
  [Trial  67] Refusals:  6/100, KL divergence: 0.1108
  [Trial 180] Refusals:  7/100, KL divergence: 0.0996
> [Trial 174] Refusals:  8/100, KL divergence: 0.0801  <-- selected
  [Trial 178] Refusals: 10/100, KL divergence: 0.0801
  [Trial 172] Refusals: 11/100, KL divergence: 0.0708
  ...

Trial 174 was selected for its low KL divergence (0.0801), indicating minimal model damage, while achieving 8/100 refusals (92% of previously-refused prompts now work).

Limitations

  • This model inherits all limitations of the base Gemma 3 12B model
  • Abliteration reduces but does not completely eliminate refusals
  • NVFP4 quantization works best on Blackwell GPUs (RTX 5090/5080) with native FP4 tensor cores, but also works on older GPUs via software dequantization

License

This model is subject to the Gemma license.

Acknowledgments

Author: DreamFast

Likes: 4

Downloads: 0

Tags: transformers, safetensors, gguf, gemma3, image-text-to-text, abliteration, heretic, uncensored, gemma, ltx-2, comfyui, video-generation, text-encoder, nvfp4, blackwell, text-generation, conversational, en, base_model:google/gemma-3-12b-it, base_model:quantized:google/gemma-3-12b-it, license:gemma, text-generation-inference, endpoints_compatible, region:us

eddy1111111/ltx2.3_nvfp4


license: apache-2.0

base from:https://huggingface.co/Lightricks/LTX-2.3

Author: eddy1111111

Likes: 3

Downloads: 0

Tags: license:apache-2.0, region:us

lylogummy/anima2b-qwen-3.5-4b

<div style="text-align: center; margin: 1em 0 0.5em 0;"> <h1 style="font-size: 2.4em; font-weight: 800; background: linear-gradient(135deg, #a78bfa 0%, #e879f9 50%, #fb7185 100%); -webkit-background-clip: text; -webkit-text-fill-color: transparent; background-clip: text; margin: 0; padding: 0; letter-spacing: -0.02em;">Anima 2B with Qwen 3.5 4B</h1> </div> <style> .gallery-img { width: 100%; border-radius: 8px; transition: transform 0.3s ease, box-shadow 0.3s ease; cursor: pointer; display: block; } .gallery-img:hover { transform: scale(1.02); box-shadow: 0 4px 20px rgba(0,0,0,0.4); z-index: 10; position: relative; } </style> <div style="margin: 1em auto 1.5em auto;"> <a href="https://cdn-uploads.huggingface.co/production/uploads/63820cd163e3fab40c8b41ca/BNG4ofA1c_AETKR6SGKg1.png" target="_blank"><img class="gallery-img" src="https://cdn-uploads.huggingface.co/production/uploads/63820cd163e3fab40c8b41ca/BNG4ofA1c_AETKR6SGKg1.png" /></a> <a href="https://cdn-uploads.huggingface.co/production/uploads/63820cd163e3fab40c8b41ca/AYws3GpLN7ZDanFksy4Nr.png" target="_blank"><img class="gallery-img" src="https://cdn-uploads.huggingface.co/production/uploads/63820cd163e3fab40c8b41ca/AYws3GpLN7ZDanFksy4Nr.png" /></a> <a href="https://cdn-uploads.huggingface.co/production/uploads/63820cd163e3fab40c8b41ca/mw8qguZ7mXd-8ZSBDeAZ9.png" target="_blank"><img class="gallery-img" src="https://cdn-uploads.huggingface.co/production/uploads/63820cd163e3fab40c8b41ca/mw8qguZ7mXd-8ZSBDeAZ9.png" /></a> <a href="https://cdn-uploads.huggingface.co/production/uploads/63820cd163e3fab40c8b41ca/JRngXFKDlr4KuBlb__M3q.png" target="_blank"><img class="gallery-img" src="https://cdn-uploads.huggingface.co/production/uploads/63820cd163e3fab40c8b41ca/JRngXFKDlr4KuBlb__M3q.png" /></a> <a href="https://cdn-uploads.huggingface.co/production/uploads/63820cd163e3fab40c8b41ca/Wq3iqEYRZPpLajRThRL0X.png" target="_blank"><img class="gallery-img" src="https://cdn-uploads.huggingface.co/production/uploads/63820cd163e3fab40c8b41ca/Wq3iqEYRZPpLajRThRL0X.png" /></a> </div>

Table of Contents

  1. The Problem
  2. Understanding the Architecture
  3. The Scaling Problem: 4B vs 0.6B
  4. Discovery: The ExpRMSNorm Breakthrough
  5. Procrustes Alignment — Rotating One Brain to Match Another
  6. Per-Dimension Affine Calibration
  7. Recommended Settings for Users
  8. The Mamba2 SSM Rewrite
  9. Tokenizer: Why Qwen3 ≠ Qwen3.5
  10. Timeline & Iteration History

The Problem

Anima 2B ships with a Qwen 3 0.6B text encoder — a small, standard transformer. The model works fine, but 0.6B parameters is a significant bottleneck for understanding complex prompts. nightknocker released a Qwen 3.5 4B hybrid encoder trained for the same ecosystem, promising better text comprehension.

The catch: you can't just swap one text encoder for another. The Anima diffusion model's LLM adapter was trained against the 0.6B's specific embedding distribution. Even though both encoders output 1024-dimensional vectors, they speak completely different "languages" — different magnitude scales, different directions for the same concepts, different statistical distributions.

Our initial naive implementation loaded correctly (all 426/426 weight tensors, 4.14B parameters, no errors), produced valid embeddings with no NaN/Inf... and generated images that were consistently worse than the tiny 0.6B.

This document explains every problem we encountered and how we solved each one.


Understanding the Architecture

Qwen 3.5 4B is not a standard transformer. It's a hybrid model alternating between two fundamentally different sequence processing mechanisms:

| Property | Value | |---|---| | Total layers | 32 | | SSM (Mamba2) layers | 24 (positions 0,1,2, 4,5,6, ..., 28,29,30) | | Self-Attention layers | 8 (positions 3, 7, 11, 15, 19, 23, 27, 31) | | Hidden size | 2560 | | Output dimension | 1024 (after projection) | | Vocabulary | 248,320 tokens | | Weight format | FP8 (F8_E4M3) with BF16 norms |

The pattern is simple: every 4th layer is self-attention, the other three are SSM blocks. The final layer (31) is attention-only with no MLP. This hybrid design gives the model the long-range memory of state space models with periodic full-attention "checkpoints" for global context.

Output pipeline: The raw 2560-dim hidden states go through a learned projection:

Linear(2560 → 1024) → ExpRMSNorm(1024) → SiLU → Linear(1024 → 1024)

This maps the model's internal representation into the 1024-dim space that the Anima adapter expects.


The Scaling Problem: 4B vs 0.6B

Here's what the raw output distributions look like side by side:

| Metric | 0.6B (Original) | 4B (Raw) | Ratio | |--------|:---:|:---:|:---:| | Global mean | -0.068 | 0.0015 | ~45× difference | | Global std | 3.36 | 0.33 | ~10× smaller | | L2 norm / token | 106.6 | 10.5 | ~10× smaller |

The 4B encoder's outputs are roughly 10× smaller in magnitude than what the Anima adapter expects. Imagine whispering instructions to someone who's used to being shouted at — the signal is there, but it's far too quiet to drive the diffusion process effectively.

This isn't a bug — it's a consequence of two models with different architectures, different training procedures, and different internal normalizations producing embeddings at different scales. The 0.6B was the encoder that the adapter was trained against, so its scale IS the expected scale.


Discovery: The ExpRMSNorm Breakthrough

Before we could even think about alignment, we had to fix a fundamental error in how we interpreted the model's normalization layer.

The Mystery of the Near-Zero Weights

All 64 internal RMSNorm layers in the model have learned weights with sensible values — centered between 0.04 and 1.11. These are normal scaling factors: the model learns to emphasize some dimensions and suppress others.

But the late normalization layer (the one in the output projection) had weights centered around -0.003. Nearly zero.

With standard RMSNorm, those weights multiply the normalized output directly:

output = weight * (x / RMS(x))

If weight ≈ -0.003, you're scaling everything down to essentially nothing. And that's exactly what happened:

| Metric | Standard RMSNorm (broken) | ExpRMSNorm (fixed) | |--------|:---:|:---:| | Output std | 0.018 | 0.324 (18× larger) | | L2 / token | 0.58 | 10.37 (18× larger) | | Token diversity | 0.003 | 0.821 (274× larger!) | | Cross-prompt similarity | 0.999 (everything identical) | 0.689 (distinguishable) |

Token diversity of 0.003 means every single token in every single prompt was being mapped to essentially the same vector. The model's understanding was being completely destroyed at the output gate.

The Fix: exp(weight) Parameterization

The late norm uses exponential weight parameterization:

output = exp(weight) * (x / RMS(x))

With weight ≈ -0.003:

  • Standard: scale = -0.003 → collapses everything
  • Exponential: scale = exp(-0.003) ≈ 0.997 → near-identity, with tiny learned perturbations

This is the difference between "scale to zero" and "scale to approximately one with fine-grained adjustments." The only reason the late norm's weights are near-zero is because it uses this parameterization — exp(0) = 1 is the neutral point.

This single fix took token diversity from 0.003 to 0.821 — from "completely collapsed" to "rich, distinguishable representations."


Procrustes Alignment — Rotating One Brain to Match Another

Even after fixing the ExpRMSNorm, the 4B generates images that don't follow the prompt well. Why? Because the 4B and 0.6B encode the same concepts in different directions.

Think of it this way: both models understand what "from the side" means, but the 0.6B might encode that as a vector pointing northeast in embedding space, while the 4B encodes it as a vector pointing southwest. The adapter was trained to interpret northeast as a side view — so when it sees southwest, it does something completely wrong.

What Is Procrustes Alignment?

Procrustes alignment finds the optimal rotation matrix R that maps one embedding space onto another:

$$R^* = \arg\min_{R} | R \cdot X_{4B} - X_{0.6B} |_F \quad \text{subject to} \quad R^T R = I$$

The constraint $R^T R = I$ means R is orthogonal — it's a pure rotation/reflection. No stretching, no squishing. Every distance between embeddings in the 4B's space is perfectly preserved. We're just reorienting the compass.

How We Computed It

We ran 41,277 prompts through both encoders and collected their mean-pooled 1024-dim embeddings. Then we applied Orthogonal Procrustes (via SVD of the cross-covariance matrix) to find the best rotation.

The results:

| | Before Alignment | After Alignment | |---|:---:|:---:| | Mean cosine similarity | -0.034 | 0.960 | | Minimum cosine similarity | -0.115 | 0.766 |

Before alignment, the two encoders had negative average cosine similarity — their concept directions were essentially uncorrelated. After: 0.96 average agreement.

Per-category breakdown:

| Category | Before → After | |---|---| | Spatial (viewpoints, poses) | -0.034 → 0.960 | | Pose | -0.021 → 0.943 | | Composition | -0.028 → 0.956 | | Character | -0.028 → 0.943 | | Environment | -0.025 → 0.954 | | Meta (quality tags) | -0.034 → 0.838 | | Multi (complex prompts) | -0.027 → 0.898 |

Rotation vs. Bias Shift

The alignment has two components:

  1. Rotation — The 1024×1024 orthogonal matrix R that reorients concept directions. This is always applied when alignment is enabled. It fixes what direction concepts point in, without changing magnitude.

  2. Bias shift — Re-centering from the 4B's mean embedding to the 0.6B's mean embedding. The 0.6B's mean has L2≈70 while the 4B's mean has L2≈5, so the full shift dramatically changes output magnitude. This is controlled by the alignment_strength slider.

The alignment_strength parameter (0.0–1.0) only controls the bias shift, not the rotation:

x_aligned = R @ (x - mean_4b) + (1 - α) * mean_4b + α * mean_06b
  • α = 0.0: Rotate only, keep 4B's own magnitude
  • α = 0.5: Rotate + halfway bias shift (recommended starting point)
  • α = 1.0: Rotate + full shift to 0.6B's distribution center

Per-Dimension Affine Calibration

Beyond rotation, the two encoders also differ in their per-dimension scales. Dimension 42 in the 0.6B might have 3× the variance of dimension 42 in the 4B, while dimension 500 might be 0.5×.

The calibration computes a per-dimension affine transform:

output_calibrated[d] = scale[d] * output_4b[d] + bias[d]

Where:

scale[d] = std_06b[d] / std_4b[d]
bias[d]  = mean_06b[d] - scale[d] * mean_4b[d]

Calibration statistics (from 30 diverse prompts):

| | Value | |---|---| | Scale range | 1.03 – 79.7 | | Scale mean | 5.47 | | Bias mean | -0.075 |

Most dimensions need ~5× scaling. Some need up to 80×. This makes sense given the 10× overall magnitude difference — it's not uniform across dimensions.

Note: Calibration and alignment serve different purposes. Alignment fixes directions (rotation). Calibration fixes magnitudes (per-dimension scaling). They can be used independently or together.


Recommended Settings for Users

Start Simple, Add Complexity

Step 1: Baseline (no alignment, no calibration)

use_alignment: OFF
use_calibration: OFF
output_scale: 1.0

Generate some images with your usual prompts. This gives you the raw 4B output — better text understanding, but the adapter may misinterpret concept directions.

Step 2: Add alignment at half strength

use_alignment: ON
alignment_strength: 0.5
use_calibration: OFF
output_scale: 1.0

This rotates the 4B's concept space to match the 0.6B (fixing spatial/pose understanding) while blending the magnitude halfway between the two encoders. Compare your results — you should see better prompt adherence, especially for viewpoints, poses, and spatial composition.

Step 3: Experiment with strength

  • If poses/viewpoints still aren't quite right, increase alignment_strength toward 1.0
  • If the image quality or detail seems to degrade at high strength, back off toward 0.3
  • The sweet spot varies by prompt type — 0.5 is a good general default

Step 4 (Optional): Try calibration

use_calibration: ON

This applies per-dimension scaling on top of alignment. It can help in some cases but may also over-correct. Test both ways and compare.

Quick Reference

| Setting | What It Does | When to Use | |---|---|---| | alignment OFF | Raw 4B embeddings | Baseline comparison | | alignment ON, strength 0.0 | Rotation only, 4B magnitude | Fix concept directions without changing scale | | alignment ON, strength 0.5 | Rotation + half bias shift | Best general starting point | | alignment ON, strength 1.0 | Full 0.6B-like distribution | Maximum compatibility with adapter | | calibration ON | Per-dimension affine scaling | Fine-grained magnitude matching | | output_scale | Uniform multiplier | Last-resort manual adjustment |


The Mamba2 SSM Rewrite

Qwen 3.5 4B is not a standard transformer you can load with a config swap — 24 of its 32 layers are Mamba2 selective state space blocks, an architecture with no off-the-shelf ComfyUI support. We had to implement the full SSM from scratch.

The approach was to work directly from the reference Mamba2 implementation, mapping every projection, convolution, and recurrence step to the weight shapes we found in the checkpoint. The initial implementation ran without errors but produced garbage embeddings — every tensor shape was valid, no NaN/Inf, just wrong math.

The rewrite came down to carefully matching the reference's data flow: which projections go through the causal conv1d and which bypass it as a gate, the full multi-dimensional state recurrence (not a scalar approximation), input-dependent discretization that makes the SSM selective, and the skip connections that the architecture relies on. Several hundred million parameters that were being loaded but never actually used in the forward pass are now contributing.

The key insight was that SSM bugs are silent — the shapes all work out, gradients would flow if you were training, and the output looks like plausible floating point numbers. The only way to catch them was methodical comparison against the reference code, projection by projection.


Tokenizer: Why Qwen3 ≠ Qwen3.5

This was an easy mistake to make — and a critical one to fix.

| | Qwen 3 (0.6B) | Qwen 3.5 (4B) | |---|:---:|:---:| | Vocabulary size | 151,936 | 248,320 | | Extra tokens | — | +96,384 (3 blocks of 32,128) |

The extra 96,384 tokens in Qwen 3.5 correspond exactly to T5's vocabulary size (32,128 × 3), suggesting the model was designed to bridge between Qwen and T5 embedding spaces.

Using the Qwen 3 tokenizer with the 4B model means:

  • Different BPE merge rules produce different token boundaries
  • Every token ID potentially maps to the wrong embedding row
  • 96,384 trained embedding rows are never accessed
  • The model receives garbled input it was never trained on

The node bundles the correct Qwen 3.5 tokenizer (248,320 tokens) and falls back to auto-downloading from Qwen/Qwen3.5-4B on HuggingFace if local files aren't found.


Timeline & Iteration History

v0.1.0 — Initial Release (2026-03-08)

  • Full custom implementation of the hybrid Mamba2/Attention architecture
  • Weight loading (426 tensors, 4.14B parameters)
  • ComfyUI CLIP-compatible
  • Result: Images generated but consistently worse than 0.6B

v0.2.0 — Mamba2 SSM Rewrite (2026-03-09)

  • Fixed 5 critical bugs in the SSM block (conv split, gate, d_state, dt, D skip)
  • ~240M previously-ignored parameters now contributing
  • Result: Better internal representations, still misaligned with adapter

v0.3.0 — ExpRMSNorm Discovery (2026-03-09)

  • Discovered the late norm uses exp(weight) parameterization
  • Token diversity went from 0.003 to 0.821 (274× improvement)
  • Result: Meaningful, distinguishable embeddings for the first time

v0.4.0 — Alignment & Calibration (2026-03-09)

  • Procrustes alignment over 41K prompts (cosine similarity: -0.03 → 0.96)
  • Per-dimension affine calibration from 30 diverse prompts
  • Correct Qwen 3.5 tokenizer (vocab=248,320)
  • Result: Substantially improved prompt adherence and image quality

This node is open source. Contributions, testing results, and alignment experiments are welcome.

Author: lylogummy

Likes: 3

Downloads: 0

Tags: region:us

temsa/govie-office-holder-reranker-v1


license: apache-2.0 language:

  • en library_name: sentence-transformers pipeline_tag: text-ranking tags:
  • sentence-transformers
  • cross-encoder
  • reranker
  • ireland
  • govie base_model: cross-encoder/mmarco-mMiniLMv2-L12-H384-v1 model-index:
  • name: govie-office-holder-reranker-v1 results:
    • task: type: text-ranking name: Text Ranking dataset: name: Office-holder canary eval type: internal metrics:
      • type: top1_accuracy value: 1.0 name: Top-1 Accuracy
      • type: mrr value: 1.0 name: MRR

govie-office-holder-reranker-v1

This is a cross-encoder reranker fine-tuned from cross-encoder/mmarco-mMiniLMv2-L12-H384-v1 for office-holder and current-role queries over public gov.ie content.

What changed

The base reranker was broadly better than the previous production reranker, but it regressed on a critical slice of office-holder queries such as:

  • Who is the Taoiseach?
  • Who is the current Taoiseach?
  • Who is the Tánaiste?
  • Who is Michael Martin?

This model is a narrow fine-tune intended to recover that query family without changing the broader runtime architecture.

Training snapshot

  • Base model: cross-encoder/mmarco-mMiniLMv2-L12-H384-v1
  • Training rows: 166
  • Eval rows: 122
  • Snapshot date: 2026-03-10
  • Max length: 128
  • Epochs: 2
  • Batch size: 16
  • Learning rate: 2e-5

Benchmark

Held-out office-holder eval set, frozen on 2026-03-10:

| Model | Top-1 | Hit@5 | MRR | | --- | ---: | ---: | ---: | | previous production reranker | 0.8889 | 1.0000 | 0.9306 | | current base reranker | 0.7778 | 1.0000 | 0.8657 | | this model | 1.0000 | 1.0000 | 1.0000 |

This evaluation slice is intentionally narrow. It is useful as a canary for office-holder/current-role behavior, not as a claim of global search quality.

Training data

The fine-tune used an internal derived dataset built from public gov.ie pages and public query templates generated from current minister and role names.

The raw internal dataset is not published as-is because it currently contains:

  • weak-supervision artifacts from live SERP captures
  • public contact details embedded in page text
  • incidental third-party contact details in some negative examples

A sanitized public dataset artifact or generator should be used instead for open release.

Intended use

  • reranking public-sector search candidates for office-holder and current-role queries
  • evaluation and ablation work on temporal/current-entity ranking behavior

Limitations

  • This is a small, domain-specific fine-tune.
  • The benchmark is narrow and snapshot-bound.
  • Live failures can still come from candidate generation or rerank promotion windows upstream of the model.
  • The snapshot reflects office holders as of 2026-03-10 and will age over time.

Usage

from sentence_transformers import CrossEncoder

model = CrossEncoder("temsa/govie-office-holder-reranker-v1")
scores = model.predict(
    [
        [
            "Who is the Taoiseach?",
            "Micheál Martin is the Taoiseach. He was appointed to this role on 23 January 2025.",
        ],
        [
            "Who is the Taoiseach?",
            "Former Taoisigh",
        ],
    ]
)
print(scores)

Attribution

Source material was derived from public gov.ie pages. Reusers should preserve attribution to the Government of Ireland and not imply any official endorsement.

Author: temsa

Likes: 2

Downloads: 0

Tags: sentence-transformers, safetensors, xlm-roberta, cross-encoder, reranker, ireland, govie, text-ranking, en, base_model:cross-encoder/mmarco-mMiniLMv2-L12-H384-v1, base_model:finetune:cross-encoder/mmarco-mMiniLMv2-L12-H384-v1, license:apache-2.0, model-index, text-embeddings-inference, endpoints_compatible, region:us

Tralalabs/PicoLM-0.5M


language:

  • en license: apache-2.0 tags:
  • gpt2
  • language-model
  • pretraining
  • causal-lm
  • tiny-model
  • from-scratch datasets:
  • roneneldan/TinyStories
  • sedthh/gutenberg_english
  • Salesforce/wikitext
  • nyu-mll/glue base_model: none pipeline_tag: text-generation

PicoLM-0.5M

A ~0.48M parameter GPT-2 style causal language model pretrained from scratch using a custom 4096-vocab BPE tokenizer. The smallest model in the PicoLM family. Trained in ~25 minutes on a single NVIDIA T4 GPU.

Model Details

| Property | Value | |---|---| | Architecture | GPT-2 (decoder-only transformer) | | Parameters | ~0.48M | | Context length | 256 tokens | | Vocabulary size | 4,096 (custom BPE) | | Layers | 4 | | Attention heads | 4 | | Hidden size | 64 | | FFN size | 256 | | Tokenizer | Custom BPE trained on TinyStories | | Training steps | 5,000 |

Training

Hardware: Google Colab, NVIDIA T4 (15GB VRAM)

Dataset mix:

  • 50% TinyStories — synthetic English children's stories
  • 25% Gutenberg English — public domain classic literature
  • 15% WikiText-2 — English Wikipedia slice
  • 10% CoLA — grammatically labeled English sentences

Training config:

  • Optimizer: AdamW (lr=5e-4, weight_decay=0.1)
  • LR schedule: Cosine with 300 warmup steps
  • Batch size: 32 x 2 grad accum = effective batch 64
  • Mixed precision: fp16
  • Streaming: yes (no full dataset download)
  • Custom BPE tokenizer trained on 50k TinyStories samples

Usage

from transformers import GPT2LMHeadModel, PreTrainedTokenizerFast, GenerationConfig
import torch

tokenizer = PreTrainedTokenizerFast.from_pretrained("Tralalabs/PicoLM-0.5M")
model = GPT2LMHeadModel.from_pretrained("Tralalabs/PicoLM-0.5M")

inputs = tokenizer("Once upon a time", return_tensors="pt")
gen_config = GenerationConfig(
    max_new_tokens=60,
    do_sample=True,
    temperature=0.9,
    pad_token_id=tokenizer.eos_token_id,
)
out = model.generate(**inputs, generation_config=gen_config)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Sample Outputs

Prompt: Once upon a time

Once upon a time the Itfore, or more as he must not of the place. As he could ween her, "I said, and all his head...

Prompt: In the beginning

In the beginning of the "I What did I think you, so the devan the I would have done...

Parameter Breakdown

| Component | Params | |---|---| | Token embedding (4096 x 64) | 262,144 | | Position embedding (256 x 64) | 16,384 | | 4 transformer layers | 198,656 | | Final LayerNorm | 128 | | LM head (tied to embedding) | 0 | | Total | ~477,312 |

Languages

English only. All training datasets are English. The model has no meaningful multilingual capability.

Knowledge

| Property | Value | |---|---| | Training data cutoff | 2023 (TinyStories generation date) | | Knowledge cutoff | ~2016 (WikiText-2 Wikipedia snapshot) | | Real-world knowledge | Effectively none | | Oldest data | Pre-1928 (Project Gutenberg) |

Limitations

  • Extremely small scale — outputs are often incoherent or repetitive
  • Custom 4096-vocab tokenizer means incompatibility with standard HuggingFace tokenizers
  • Gutenberg OCR artifacts may appear in outputs (stray numbers, broken unicode)
  • Not instruction-tuned — base pretrained model only
  • No real-world knowledge retention at this scale
  • Best treated as a research/educational artifact

Intended Use

  • Educational — understanding minimum viable language model pretraining
  • Baseline for ultra-small model research
  • Experimentation with tiny model fine-tuning
  • Part of the PicoLM model family

PicoLM Family

| Model | Params | Status | |---|---|---| | PicoLM-0.5M | 0.48M | Released | | PicoLM-15M | 19M | Released | | PicoLM-15M-IT | 19M | Released | | PicoLM-60M | 60M | Planned |

Citation

@misc{picolm2026,
  author = {Tralalabs},
  title = {PicoLM-0.5M: A Minimum Viable Language Model},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/Tralalabs/PicoLM-0.5M}
}

Author: Tralalabs

Likes: 2

Downloads: 0

Tags: safetensors, gpt2, language-model, pretraining, causal-lm, tiny-model, from-scratch, text-generation, en, dataset:roneneldan/TinyStories, dataset:sedthh/gutenberg_english, dataset:Salesforce/wikitext, dataset:nyu-mll/glue, license:apache-2.0, region:us

lukealonso/Qwen3.5-397B-A17B-NVFP4


base_model:

  • Qwen/Qwen3.5-397B-A17B license: apache-2.0 tags:
  • qwen3.5
  • moe
  • quantized
  • nvfp4
  • fp4
  • multimodal
  • vision pipeline_tag: text-generation

Model Description

Qwen3.5-397B-A17B-NVFP4 is an NVFP4-quantized version of Qwen/Qwen3.5-397B-A17B, a 397B-parameter Mixture-of-Experts vision-language model with 17B active parameters, 512 experts per layer (10 active), and hybrid attention (softmax + linear/DeltaNet).

The original BF16 weights were quantized to NVFP4 (4-bit with blockwise FP8 scales per 16 elements) using NVIDIA Model Optimizer v0.37.0.

What's quantized

Only the routed MoE expert MLP layers (gate, up, and down projections) are quantized to NVFP4. The following are left unquantized in BF16:

  • Shared expert MLPs (process every token, unlike routed experts. thanks to Festr for pointing this out)
  • Attention layers (softmax attention and DeltaNet linear attention)
  • Vision encoder (ViT)
  • Router / gate weights
  • MTP (multi-token prediction) draft model
  • Embeddings, layer norms, lm_head

Since the expert weights constitute the vast majority of the 397B parameters, this still yields significant memory savings (~233 GB on disk).

Calibration methodology

Calibration uses natural top-k routing rather than forcing all experts to activate, so each expert's quantization scales reflect the token distributions it actually sees during inference. To compensate, calibration was run on a much larger number of samples than typical (~16k+ samples across multiple datasets) to ensure broad expert coverage through natural routing alone.

Calibration datasets:

  • Agentic coding — function calling, tool use, and code generation samples (4096 samples, max 1024 tokens)
  • Multimodal VQA — image + text visual question answering from COCO val2017 (~200 samples, max 2048 tokens)
  • Deep reasoning — long-context math, science, and multi-step reasoning (all samples, max 8192 tokens)
  • Diverse instruction following — multilingual, multi-domain instruction/response pairs (4096 samples, max 1024 tokens)

All 512 routed experts per layer were activated during calibration — no experts were missed entirely. Post-calibration, 256 out of ~184k routed expert quantizers (~0.14%) had their amaxes floored to median/10 of their peer group to stabilize rarely-hit experts. Gate/up projection weight scales were tied for fused w13 export compatibility.

Compared to nvidia/Qwen3.5-397B-A17B-NVFP4 (calibrated on cnn_dailymail + Nemotron), the input activation scales in this checkpoint show significantly higher entropy per layer (avg 3.92 vs 2.35 bits across all 60 layers), indicating more differentiated per-expert scales. This suggests the diverse calibration mix produced scales that better reflect each expert's actual activation range, rather than collapsing toward similar values from homogeneous calibration data.

Quality

Initial testing has been positive, but you should evaluate against your specific use case.

How to Run

SGLang

Tested on 4x RTX Pro 6000 Blackwell with PCIe.

python3 -m sglang.launch_server \
  --model lukealonso/Qwen3.5-397B-A17B-NVFP4 \
  --served-model-name Qwen3.5 \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --tensor-parallel-size 4 \
  --expert-parallel-size 4 \
  --quantization modelopt_fp4 \
  --kv-cache-dtype fp8_e4m3 \
  --trust-remote-code \
  --attention-backend triton \
  --moe-runner-backend flashinfer_cutlass \
  --fp4-gemm-backend flashinfer_cudnn \
  --cuda-graph-max-bs 10 \
  --max-running-requests 10 \
  --chunked-prefill-size 32768 \
  --speculative-algo NEXTN \
  --speculative-num-steps 5 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 6 \
  --mamba-scheduler-strategy extra_buffer \
  --page-size 64 \
  --mem-fraction-static 0.8 \
  --host 0.0.0.0 --port 8000

Notes:

  • The MTP draft model is included for speculative decoding (--speculative-algo NEXTN).
  • If you experience NCCL hangs with P2P, make sure you have iommu=pt (and amd_iommu=pt on AMD platforms) in your kernel command line.
  • Vision/multimodal inference is supported — the full ViT encoder is included unquantized.

License

Apache 2.0, following the base model.

Author: lukealonso

Likes: 2

Downloads: 0

Tags: safetensors, qwen3_5_moe, qwen3.5, moe, quantized, nvfp4, fp4, multimodal, vision, text-generation, conversational, base_model:Qwen/Qwen3.5-397B-A17B, base_model:quantized:Qwen/Qwen3.5-397B-A17B, license:apache-2.0, modelopt, region:us