Todays AI Summary

AI Developments: NVIDIA's Reasoning Models, Generative Assembly, and More

This week's AI landscape features advancements in reasoning models, generative assembly, and techniques for enhancing long video generation. Here's a breakdown of the key developments:

Research Highlights

  • Prompt-to-Product: Generative Assembly via Bimanual Manipulation (arXiv:2508.21063): This paper introduces an automated pipeline that generates real-world assembly products from natural language prompts, using LEGO bricks as the assembly platform. The system designs physically buildable brick structures and uses a bimanual robotic system to construct them.
  • OnGoal: Tracking and Visualizing Conversational Goals (arXiv:2508.21061): This research presents OnGoal, an LLM chat interface that provides real-time feedback on goal alignment, explanations for evaluation results, and overviews of goal progression. A user study showed that OnGoal helps users achieve their goals more efficiently.
  • Mixture of Contexts for Long Video Generation (arXiv:2508.21058): This paper introduces Mixture of Contexts (MoC), a sparse attention routing module for long-context video generation. MoC enables models to retain and retrieve salient events across long ranges by dynamically selecting informative chunks to attend to, improving memory and consistency.
  • FakeParts: a New Family of AI-Generated DeepFakes (arXiv:2508.21052): This paper introduces FakeParts, a new class of deepfakes characterized by subtle, localized manipulations to specific spatial regions or temporal segments of otherwise authentic videos. To address the critical gap in detection capabilities, the authors present FakePartsBench, the first large-scale benchmark dataset specifically designed to capture the full spectrum of partial deepfakes.
  • Enabling Equitable Access to Trustworthy Financial Reasoning (arXiv:2508.21051): This paper proposes an approach that integrates LLMs with a symbolic solver to calculate tax obligations. The results demonstrate the promise and economic feasibility of neuro-symbolic architectures for increasing equitable access to reliable tax assistance.
  • Veritas: Generalizable Deepfake Detection via Pattern-Aware Reasoning (arXiv:2508.21048): This paper introduces Veritas, a multi-modal large language model (MLLM) based deepfake detector. Different from vanilla chain-of-thought (CoT), the authors introduce pattern-aware reasoning that involves critical reasoning patterns such as "planning" and "self-reflection" to emulate human forensic process.
  • Understanding, Protecting, and Augmenting Human Cognition with Generative AI (arXiv:2508.21036): This paper synthesizes the material from the CHI 2025 workshop on Tools for Thought to begin mapping the space of research and design opportunities and to catalyze a multidisciplinary community around this pressing area of research.
  • Inference-Time Alignment Control for Diffusion Models with Reinforcement Learning Guidance (arXiv:2508.21016): This work introduces Reinforcement Learning Guidance (RLG), an inference-time method that adapts Classifier-Free Guidance (CFG) by combining the outputs of the base and RL fine-tuned models via a geometric average.
  • ChainReaction! Structured Approach with Causal Chains as Intermediate Representations for Improved and Explainable Causal Video Question Answering (arXiv:2508.21010): This paper proposes a novel, modular framework that explicitly decouples causal reasoning from answer generation, introducing natural language causal chains as interpretable intermediate representations.
  • Train-Once Plan-Anywhere Kinodynamic Motion Planning via Diffusion Trees (arXiv:2508.21001): This paper presents Diffusion Tree (DiTree): a provably-generalizable framework leveraging diffusion policies (DPs) as informed samplers to efficiently guide state-space search within SBPs.

Model Releases

  • QuantFactory/NVIDIA-Nemotron-Nano-12B-v2-GGUF & QuantFactory/NVIDIA-Nemotron-Nano-9B-v2-GGUF: Quantized versions of NVIDIA's Nemotron-Nano models, designed for reasoning and chat applications. The 12B model achieves strong benchmark results, including 76.25% on AIME25 and 97.75% on MATH500. These models support runtime "thinking" budget control, allowing users to specify how many tokens the model can use for reasoning.
  • **QuantFactory/Datarus-R1-14B-preview-G

AI Papers for 2026-04-06

ActionParty: Multi-Subject Action Binding in Generative Video Games

Recent advances in video diffusion have enabled the development of "world models" capable of simulating interactive environments. However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene. In this work, we tackle a fundamental issue of action binding in existing video diffusion models, which struggle to associate specific actions with their corresponding subjects. For this purpose, we propose ActionParty, an action controllable multi-subject world model for generative video games. It introduces subject state tokens, i.e. latent variables that persistently capture the state of each subject in the scene. By jointly modeling state tokens and video latents with a spatial biasing mechanism, we disentangle global video frame rendering from individual action-controlled subject updates. We evaluate ActionParty on the Melting Pot benchmark, demonstrating the first video world model capable of controlling up to seven players simultaneously across 46 diverse environments. Our results show significant improvements in action-following accuracy and identity consistency, while enabling robust autoregressive tracking of subjects through complex interactions.

Steerable Visual Representations

Pretrained Vision Transformers (ViTs) such as DINOv2 and MAE provide generic image features that can be applied to a variety of downstream tasks such as retrieval, classification, and segmentation. However, such representations tend to focus on the most salient visual cues in the image, with no way to direct them toward less prominent concepts of interest. In contrast, Multimodal LLMs can be guided with textual prompts, but the resulting representations tend to be language-centric and lose their effectiveness for generic visual tasks. To address this, we introduce Steerable Visual Representations, a new class of visual representations, whose global and local features can be steered with natural language. While most vision-language models (e.g., CLIP) fuse text with visual features after encoding (late fusion), we inject text directly into the layers of the visual encoder (early fusion) via lightweight cross-attention. We introduce benchmarks for measuring representational steerability, and demonstrate that our steerable visual features can focus on any desired objects in an image while preserving the underlying representation quality. Our method also matches or outperforms dedicated approaches on anomaly detection and personalized object discrimination, exhibiting zero-shot generalization to out-of-distribution tasks.

Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation

Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that subsequent fine-tuning struggles to fully recover. These findings suggest that \emph{token initialization} is a key bottleneck when extending LMs with new vocabularies. Motivated by this diagnosis, we propose the \emph{Grounded Token Initialization Hypothesis}: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning better enables the model to leverage its general-purpose knowledge for novel-token domains. We operationalize this hypothesis as GTI (Grounded Token Initialization), a lightweight grounding stage that, prior to fine-tuning, maps new tokens to distinct, semantically meaningful locations in the pretrained embedding space using only paired linguistic supervision. Despite its simplicity, GTI outperforms both mean initialization and existing auxiliary-task adaptation methods in the majority of evaluation settings across multiple generative recommendation benchmarks, including industry-scale and public datasets. Further analyses show that grounded embeddings produce richer inter-token structure that persists through fine-tuning, corroborating the hypothesis that initialization quality is a key bottleneck in vocabulary extension.

Batched Contextual Reinforcement: A Task-Scaling Law for Efficient Reasoning

Large Language Models employing Chain-of-Thought reasoning achieve strong performance but suffer from excessive token consumption that inflates inference costs. Existing efficiency methods such as explicit length penalties, difficulty estimators, or multi-stage curricula either degrade reasoning quality or require complex training pipelines. We introduce Batched Contextual Reinforcement, a minimalist, single-stage training paradigm that unlocks efficient reasoning through a simple structural modification: training the model to solve N problems simultaneously within a shared context window, rewarded purely by per-instance accuracy. This formulation creates an implicit token budget that yields several key findings: (1) We identify a novel task-scaling law: as the number of concurrent problems N increases during inference, per-problem token usage decreases monotonically while accuracy degrades far more gracefully than baselines, establishing N as a controllable throughput dimension. (2) BCR challenges the traditional accuracy-efficiency trade-off by demonstrating a "free lunch" phenomenon at standard single-problem inference. Across both 1.5B and 4B model families, BCR reduces token usage by 15.8% to 62.6% while consistently maintaining or improving accuracy across five major mathematical benchmarks. (3) Qualitative analyses reveal emergent self-regulated efficiency, where models autonomously eliminate redundant metacognitive loops without explicit length supervision. (4) Crucially, we empirically demonstrate that implicit budget constraints successfully circumvent the adversarial gradients and catastrophic optimization collapse inherent to explicit length penalties, offering a highly stable, constraint-based alternative for length control. These results prove BCR practical, showing simple structural incentives unlock latent high-density reasoning in LLMs.

Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models

Standard LLM benchmarks evaluate the assistant turn: the model generates a response to an input, a verifier scores correctness, and the analysis ends. This paradigm leaves unmeasured whether the LLM encodes any awareness of what follows the assistant response. We propose user-turn generation as a probe of this gap: given a conversation context of user query and assistant response, we let a model generate under the user role. If the model's weights encode interaction awareness, the generated user turn will be a grounded follow-up that reacts to the preceding context. Through experiments across $11$ open-weight LLMs (Qwen3.5, gpt-oss, GLM) and $5$ datasets (math reasoning, instruction following, conversation), we show that interaction awareness is decoupled from task accuracy. In particular, within the Qwen3.5 family, GSM8K accuracy scales from $41\%$ ($0.8$B) to $96.8\%$ ($397$B-A$17$B), yet genuine follow-up rates under deterministic generation remain near zero. In contrast, higher temperature sampling reveals interaction awareness is latent with follow up rates reaching $22\%$. Controlled perturbations validate that the proposed probe measures a real property of the model, and collaboration-oriented post-training on Qwen3.5-2B demonstrates an increase in follow-up rates. Our results show that user-turn generation captures a dimension of LLM behavior, interaction awareness, that is unexplored and invisible with current assistant-only benchmarks.

VOID: Video Object and Interaction Deletion

Existing video object removal methods excel at inpainting content "behind" the object and correcting appearance-level artifacts such as shadows and reflections. However, when the removed object has more significant interactions, such as collisions with other objects, current models fail to correct them and produce implausible results. We present VOID, a video object removal framework designed to perform physically-plausible inpainting in these complex scenarios. To train the model, we generate a new paired dataset of counterfactual object removals using Kubric and HUMOTO, where removing an object requires altering downstream physical interactions. During inference, a vision-language model identifies regions of the scene affected by the removed object. These regions are then used to guide a video diffusion model that generates physically consistent counterfactual outcomes. Experiments on both synthetic and real data show that our approach better preserves consistent scene dynamics after object removal compared to prior video object removal methods. We hope this framework sheds light on how to make video editing models better simulators of the world through high-level causal reasoning.

Omni123: Exploring 3D Native Foundation Models with Limited 3D Data by Unifying Text to 2D and 3D Generation

Recent multimodal large language models have achieved strong performance in unified text and image understanding and generation, yet extending such native capability to 3D remains challenging due to limited data. Compared to abundant 2D imagery, high-quality 3D assets are scarce, making 3D synthesis under-constrained. Existing methods often rely on indirect pipelines that edit in 2D and lift results into 3D via optimization, sacrificing geometric consistency. We present Omni123, a 3D-native foundation model that unifies text-to-2D and text-to-3D generation within a single autoregressive framework. Our key insight is that cross-modal consistency between images and 3D can serve as an implicit structural constraint. By representing text, images, and 3D as discrete tokens in a shared sequence space, the model leverages abundant 2D data as a geometric prior to improve 3D representations. We introduce an interleaved X-to-X training paradigm that coordinates diverse cross-modal tasks over heterogeneous paired datasets without requiring fully aligned text-image-3D triplets. By traversing semantic-visual-geometric cycles (e.g., text to image to 3D to image) within autoregressive sequences, the model jointly enforces semantic alignment, appearance fidelity, and multi-view geometric consistency. Experiments show that Omni123 significantly improves text-guided 3D generation and editing, demonstrating a scalable path toward multimodal 3D world models.

Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training large language models. While Group Relative Policy Optimization (GRPO) is widely adopted, its coarse credit assignment uniformly penalizes failed rollouts, lacking the token-level focus needed to efficiently address specific deviations. Self-Distillation Policy Optimization (SDPO) addresses this by providing denser, more targeted logit-level supervision that facilitates rapid early improvement, yet it frequently collapses during prolonged training. We trace this late-stage instability to two intrinsic flaws: self-distillation on already-correct samples introduces optimization ambiguity, and the self-teacher's signal reliability progressively degrades. To resolve these issues, we propose Sample-Routed Policy Optimization (SRPO), a unified on-policy framework that routes correct samples to GRPO's reward-aligned reinforcement and failed samples to SDPO's targeted logit-level correction. SRPO further incorporates an entropy-aware dynamic weighting mechanism to suppress high-entropy, unreliable distillation targets while emphasizing confident ones. Evaluated across five benchmarks and two model scales, SRPO achieves both the rapid early improvement of SDPO and the long-horizon stability of GRPO. It consistently surpasses the peak performance of both baselines, raising the five-benchmark average on Qwen3-8B by 3.4% over GRPO and 6.3% over SDPO, while simultaneously yielding moderate response lengths and lowering per-step compute cost by up to 17.2%.

Novel Memory Forgetting Techniques for Autonomous AI Agents: Balancing Relevance and Efficiency

Long-horizon conversational agents require persistent memory for coherent reasoning, yet uncontrolled accumulation causes temporal decay and false memory propagation. Benchmarks such as LOCOMO and LOCCO report performance degradation from 0.455 to 0.05 across stages, while MultiWOZ shows 78.2% accuracy with 6.8% false memory rate under persistent retention. This work introduces an adaptive budgeted forgetting framework that regulates memory through relevanceguided scoring and bounded optimization. The approach integrates recency, frequency, and semantic alignment to maintain stability under constrained context. Comparative analysis demonstrates improved long-horizon F1 beyond 0.583 baseline levels, higher retention consistency, and reduced false memory behavior without increasing context usage. These findings confirm that structured forgetting preserves reasoning performance while preventing unbounded memory growth in extended conversational settings.

The Self Driving Portfolio: Agentic Architecture for Institutional Asset Management

Agentic AI shifts the investor's role from analytical execution to oversight. We present an agentic strategic asset allocation pipeline in which approximately 50 specialized agents produce capital market assumptions, construct portfolios using over 20 competing methods, and critique and vote on each other's output. A researcher agent proposes new portfolio construction methods not yet represented, and a meta-agent compares past forecasts against realized returns and rewrites agent code and prompts to improve future performance. The entire pipeline is governed by the Investment Policy Statement--the same document that guides human portfolio managers can now constrain and direct autonomous agents.

AI Models

0xSero/gemma-4-21b-a4b-it-REAP


language:

  • en license: gemma tags:
  • safetensors
  • gemma4
  • moe
  • pruning
  • reap
  • cerebras
  • expert-pruning base_model:
  • google/gemma-4-26b-a4b-it library_name: transformers pipeline_tag: text-generation

Gemma 4 21B-A4B-it REAP

20% expert-pruned version of google/gemma-4-26b-a4b-it using Cerebras REAP (Router-weighted Expert Activation Pruning).

| | Original | This Model (0.20) | 0.30 variant | |---|---:|---:|---:| | Total params | ~26B | 21.34B | 19.02B | | Experts per layer | 128 | 103 | 90 | | Active params/tok | ~4B | ~4B | ~4B | | Experts/tok | 8 | 8 | 8 | | Format | BF16 | BF16 | BF16 | | Disk size | ~52 GB | ~43 GB | ~36 GB |

REAP removes 20% of MoE experts (25 of 128 per layer) while preserving the model's routing behavior. The active parameter count per token is unchanged since the router still selects 8 experts per token from the remaining pool. This yields an ~18% reduction in total disk/memory footprint.

How This Model Was Made

Step 1: Calibration (Activation Observation)

We ran the full Gemma 4 26B-A4B-it model over a curated calibration dataset to record expert activation patterns. The observer hooks capture router gate values, expert activation norms, and routing frequencies for every layer across all calibration tokens.

Calibration dataset: 22,000 samples drawn from 12 sources covering coding, reasoning, math, science, tool-calling, and agentic tasks:

| Category | Samples | Source Dataset | |----------|--------:|----------------| | Coding (general) | 1,000 | theblackcat102/evol-codealpaca-v1 | | Coding (additional) | 1,636 | theblackcat102/evol-codealpaca-v1 | | Reasoning -- code | 3,480 | open-r1/Mixture-of-Thoughts[code] | | Reasoning -- math | 3,578 | open-r1/Mixture-of-Thoughts[math] | | Reasoning -- science | 3,576 | open-r1/Mixture-of-Thoughts[science] | | Tool calling | 1,000 | Salesforce/xlam-function-calling-60k | | Agentic coding | 1,000 | SWE-bench/SWE-smith-trajectories | | Biomedical QA | 800 | qiaojin/PubMedQA[pqa_labeled] | | Science QA | 800 | derek-thomas/ScienceQA | | Grade-school math | 4,466 | openai/gsm8k[main] | | Competition math | 500 | HuggingFaceH4/MATH-500 | | Code correctness | 164 | evalplus/humanevalplus | | Total | 22,000 | |

Step 2: REAP Pruning

Using the recorded activation data, REAP scores each expert's importance per layer by combining router gate values, expert activation norms, and frequency-weighted saliency. The lowest-scoring 20% of experts (25 per layer) are removed. Router logits are renormalized post-pruning to maintain the output distribution.

Pruning Configuration

| Parameter | Value | |-----------|-------| | Compression ratio | 0.20 (20% expert removal) | | Original experts per layer | 128 | | Remaining experts per layer | 103 | | Pruning method | REAP | | Distance measure | Angular (cosine) | | Router weight renormalization | Yes | | Seed | 42 |

Benchmark Results

Accuracy (generative, 0-shot, 50 samples/task, thinking enabled, vLLM 0.19, 4x RTX 3090)

Evaluated using lm-eval generative tasks with --apply_chat_template and think_end_token=<channel|> to properly handle Gemma 4's thinking mode. Scores extracted from model responses using regex matching.

| Task | Original | REAP 0.20 | REAP 0.30 | |------|-------:|-------:|-------:| | Elementary Math | 92% | 90% | 88% | | Philosophy | 92% | 88% | 74% | | World Religions | 90% | 64% | 48% | | College CS | 56% | 76% | 68% | | HS Math | 24%* | 44%* | 48%* | | Abstract Algebra | 12%* | 28%* | 28%* | | College Math | 16%* | 18%* | 24%* | | GSM8K | 86% | 84% | -- |

* Tasks with significant extraction failures (model outputs equations rather than single letters). Real accuracy likely higher for all models.

Notes:

  • Gemma 4 is a thinking model -- it reasons internally before answering. Standard loglikelihood-based benchmarks give incorrect results because the model wants to think first.
  • GSM8K uses flexible-extract which handles thinking output well.
  • College CS and math tasks show REAP sometimes outperforming the original, likely due to sampling variance at n=50.

Generation Quality: Side-by-Side (14 prompts, temp=0.7, top_p=0.9, max 2048 tokens)

Both the original and REAP 0.20 models were tested on 14 challenging prompts across coding, math, philosophy, long-context, and repetition stress with proper chat template formatting.

| Domain | N | Orig AvgWords | REAP AvgWords | Orig Loop | REAP Loop | Orig Collapse | REAP Collapse | |--------|--:|-------------:|--------------:|----------:|----------:|--------------:|--------------:| | Coding | 3 | 670 | 648 | 0% | 0% | 0% | 0% | | Math reasoning | 3 | 296 | 261 | 0% | 0% | 0% | 0% | | Philosophy | 3 | 819 | 727 | 0% | 0% | 0% | 0% | | Long context | 2 | 1210 | 854 | 50% | 0% | 0% | 0% | | Repetition stress | 3 | 1088 | 1099 | 33% | 33% | 0% | 0% |

12/14 clean ties, 1 REAP win (long-context), 1 mutual mild loop (sorting algorithms). The REAP 0.20 model is essentially indistinguishable from the original on generation quality.

Architecture

Gemma 4 uses a hybrid sliding/full attention MoE architecture:

  • 30 transformer layers
  • Sliding attention (window=1024) for 25 layers, full attention every 6th layer
  • MoE FFN with 103 remaining experts per layer (originally 128), 8 active per token
  • Thinking model -- uses <|channel>thought / <|channel>response channels
  • Multimodal -- supports text and vision inputs
  • Context window: 262,144 tokens
  • Vocab size: 262,144

Usage

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "0xSero/gemma-4-21b-a4b-it-REAP"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto", trust_remote_code=True)

messages = [{"role": "user", "content": "Write a quicksort in Python."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4096)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

vLLM

pip install vllm>=0.19 transformers>=5.0

vllm serve 0xSero/gemma-4-21b-a4b-it-REAP \
    --tensor-parallel-size 2 \
    --enforce-eager \
    --gpu-memory-utilization 0.9 \
    --max-model-len 16384 \
    --trust-remote-code

Citation

@inproceedings{lasby2025reap,
  title={{REAP} the Experts: Why Pruning Prevails for One-Shot {MoE} Compression},
  author={Lasby, Mike and others},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026},
  url={https://arxiv.org/abs/2510.13999}
}

Links

Author: 0xSero

Likes: 34

Downloads: 0

Tags: transformers, safetensors, gemma4, image-text-to-text, moe, pruning, reap, cerebras, expert-pruning, text-generation, conversational, en, arxiv:2510.13999, license:gemma, endpoints_compatible, region:us

caiovicentino1/FLUX.2-klein-9B-PolarQuant-Q5


license: other license_name: flux-non-commercial tags:

  • polarquant
  • flux
  • text-to-image
  • diffusion
  • image-generation
  • bit-packed base_model: black-forest-labs/FLUX.2-klein-9B pipeline_tag: text-to-image arxiv: "2603.29078"

🧊 FLUX.2-klein-9B — PolarQuant Q5

First PolarQuant quantized FLUX.2 — 9B rectified flow transformer for text-to-image & image-to-image.

šŸ“Š Compression Results

| | BF16 (Original) | FP8 (Official) | PolarQuant Q5 | |---|---|---|---| | Download | 18 GB | 9 GB | 18.8 GB | | cos_sim | — | — | 0.9986 | | Quality | Baseline | Good | Near-lossless | | Layers quantized | — | — | 121 | | Layers preserved | — | — | 304 (norms, scales) | | Bit packing | — | — | 5-bit packed (1.6x vs int8) |

Quality: ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ 0.9986 cos_sim
         ↑ Practically identical to original

šŸ—ļø Architecture

FLUX.2-klein-9B (Rectified Flow Transformer)
ā”œā”€ā”€ double_blocks: img_attn + txt_attn + img_mlp + txt_mlp
ā”œā”€ā”€ single_blocks: attn + mlp  
ā”œā”€ā”€ 9.1B parameters total
ā”œā”€ā”€ 121 layers → PQ5 (Hadamard + Lloyd-Max 32 centroids)
└── 304 layers → BF16 preserved (norms, scales, embeddings)

šŸ”¬ Method: PolarQuant Q5

  1. Hadamard Rotation — Walsh-Hadamard transform spreads weight outliers uniformly across dimensions
  2. Lloyd-Max Quantization — 32 optimal centroids for Gaussian-distributed weights (mathematically proven optimal)
  3. 5-bit Packing — 8 codes packed into 5 bytes (62.5% of int8 storage)
  4. Per-block Norms — FP16 norms preserve magnitude information
cos_sim per layer: min=0.9986 | mean=0.9986 | max=0.9986

šŸ“¦ Files

polarquant/
ā”œā”€ā”€ codes_shard_000.safetensors   (PQ5 5-bit packed)
ā”œā”€ā”€ codes_shard_001.safetensors
ā”œā”€ā”€ codes_shard_002.safetensors
ā”œā”€ā”€ codes_shard_003.safetensors
ā”œā”€ā”€ codes_shard_004.safetensors
ā”œā”€ā”€ codes_shard_005.safetensors
ā”œā”€ā”€ codes_shard_006.safetensors
ā”œā”€ā”€ codes_shard_007.safetensors
ā”œā”€ā”€ bf16_kept.safetensors         (norms, scales, embeddings — 2.7 GB)
└── polar_config.json

šŸ’» Usage

from safetensors.torch import load_file
from scipy.stats import norm as sp_norm
import torch, math

# Load PQ5 codes + kept weights
codes = load_file('polarquant/codes_shard_000.safetensors')
bf16 = load_file('polarquant/bf16_kept.safetensors')

# Dequantize: unpack 5-bit → centroid lookup → inverse Hadamard → denormalize
# Full dequantization code: github.com/caiovicentino/polarengine-vllm

šŸ”— Links

šŸ“œ License

FLUX Non-Commercial License — non-commercial use only. This is a quantized derivative of the original FLUX.2-klein-9B model.

šŸ“– Citation

@article{polarquant2026,
  title={PolarQuant: Hadamard-Rotated Lloyd-Max Quantization for Large Language Models},
  author={Vicentino, Caio},
  journal={arXiv preprint arXiv:2603.29078},
  year={2026}
}

Quantized with PolarQuant — Hadamard + Lloyd-Max optimal quantization.


šŸš€ Quick Start

Install

pip install git+https://github.com/caiovicentino/polarengine-vllm.git

Load & Generate (1 line!)

from polarengine_vllm import PolarQuantModel

model = PolarQuantModel.from_pretrained("caiovicentino1/FLUX.2-klein-9B-PolarQuant-Q5")
print(model.generate("Hello, how are you?", max_new_tokens=100))

With KV Cache Compression (5.3x more context)

model = PolarQuantModel.from_pretrained("caiovicentino1/FLUX.2-klein-9B-PolarQuant-Q5", kv_cache_nbits=3)
# KV cache now uses 5.3x less memory — fit longer conversations!
print(model.generate("Explain quantum computing in detail.", max_new_tokens=500))

Benchmark

polarquant bench caiovicentino1/FLUX.2-klein-9B-PolarQuant-Q5 --ppl --chart

Gradio Demo

polarquant demo caiovicentino1/FLUX.2-klein-9B-PolarQuant-Q5 --share

šŸ“¦ Method: PolarQuant

Hadamard Rotation + Lloyd-Max Optimal Centroids

Unlike GGUF (uniform quantization), PolarQuant places quantization levels where weight density is highest — mathematically proven optimal for Gaussian-distributed neural network weights.

PolarQuant Q5 (cos_sim > 0.996) > GGUF Q5_K_M (~0.99) at same size

šŸ”— Links

Author: caiovicentino1

Likes: 7

Downloads: 0

Tags: polarquant, flux, text-to-image, diffusion, image-generation, bit-packed, arxiv:2603.29078, base_model:black-forest-labs/FLUX.2-klein-9B, base_model:finetune:black-forest-labs/FLUX.2-klein-9B, license:other, region:us

mudler/gemma-4-26B-A4B-it-heretic-APEX-GGUF


license: gemma base_model: coder3101/gemma-4-26B-A4B-it-heretic tags:

  • gguf
  • quantized
  • apex
  • moe
  • mixture-of-experts
  • gemma4
  • vlm
  • vision

Gemma 4 26B-A4B Heretic (Abliterated) APEX GGUF

APEX (Adaptive Precision for EXpert Models) quantizations of gemma-4-26B-A4B-it-heretic — an abliterated (uncensored) version of Gemma 4, created with the Heretic tool (v1.2.0) using Arbitrary-Rank Ablation (ARA) on layers 10-30 to reduce refusals while preserving capabilities (KL divergence 0.0499 from original).

Brought to you by the LocalAI team | APEX Project | Technical Report

Benchmark Results

Benchmarks coming soon. For reference APEX benchmarks on the Qwen3.5-35B-A3B architecture, see mudler/Qwen3.5-35B-A3B-APEX-GGUF.

Available Files

| File | Profile | Size | Best For | |------|---------|------|----------| | gemma-4-26B-A4B-heretic-APEX-I-Balanced.gguf | I-Balanced | ~19 GB | Best overall quality/size ratio | | gemma-4-26B-A4B-heretic-APEX-I-Quality.gguf | I-Quality | ~20 GB | Highest quality with imatrix | | gemma-4-26B-A4B-heretic-APEX-Quality.gguf | Quality | ~20 GB | Highest quality standard | | gemma-4-26B-A4B-heretic-APEX-Balanced.gguf | Balanced | ~19 GB | General purpose | | gemma-4-26B-A4B-heretic-APEX-I-Compact.gguf | I-Compact | ~15 GB | Consumer GPUs, best quality/size | | gemma-4-26B-A4B-heretic-APEX-Compact.gguf | Compact | ~15 GB | Consumer GPUs | | gemma-4-26B-A4B-heretic-APEX-I-Mini.gguf | I-Mini | ~13 GB | Smallest viable, fastest inference | | mmproj.gguf | Vision projector | ~1.2 GB | Required for image understanding |

What is APEX?

APEX is a quantization strategy for Mixture-of-Experts (MoE) models. It classifies tensors by role (routed expert, shared expert, attention) and applies a layer-wise precision gradient -- edge layers get higher precision, middle layers get more aggressive compression. I-variants use diverse imatrix calibration (chat, code, reasoning, tool-calling, agentic traces, Wikipedia).

See the APEX project for full details, technical report, and scripts.

Architecture

  • Model: gemma-4-26B-A4B-it-heretic (same architecture as gemma-4-26B-A4B-it)
  • Layers: 30
  • Experts: 128 routed (8 active per token)
  • Total Parameters: 26B
  • Active Parameters: ~4B per token
  • Vision: Built-in vision encoder (mmproj included)
  • APEX Config: 5+5 symmetric edge gradient across 30 layers
  • Calibration: v1.3 diverse dataset

Run with LocalAI

local-ai run mudler/gemma-4-26B-A4B-it-heretic-APEX-GGUF@gemma-4-26B-A4B-heretic-APEX-I-Balanced.gguf

Credits

APEX is brought to you by the LocalAI team. Developed through human-driven, AI-assisted research. Built on llama.cpp.

Author: mudler

Likes: 6

Downloads: 2731

Tags: gguf, quantized, apex, moe, mixture-of-experts, gemma4, vlm, vision, base_model:coder3101/gemma-4-26B-A4B-it-heretic, base_model:quantized:coder3101/gemma-4-26B-A4B-it-heretic, license:gemma, endpoints_compatible, region:us, conversational

Zero-Point-AI/MARTHA-MINI-POCKET-1.5B


license: apache-2.0 language:

  • en base_model: unsloth/Qwen2.5-1.5B-Instruct pipeline_tag: text-generation library_name: transformers tags:
  • zero-point-ai
  • martha
  • pocket
  • qwen2
  • fine-tuned
  • dundee
  • scotland
  • lora
  • gguf
  • conversational
  • instruct model-index:
  • name: MARTHA-MINI-POCKET-1.5B results: []

MARTHA-MINI-POCKET-1.5B

Pocket-sized. Full-mouthed. Dundee-born. Built by Zero Point Intelligence Ltd, Dundee, Scotland. Published by Zero Point AI. Intelligence From The Void. MARTHA-MINI-POCKET is a 1.5B parameter text model — the pocket sibling of the MARTHA-GEMMA 4B omni. Small enough for a laptop, a Pi, a phone. Big enough to carry a soul.

Helpful, accurate, direct. Nae shyte.

Personality trained into the weights via curated examples. Comes with attitude, stays within reason. Mostly.


Quick Start

Ollama

ollama create martha-pocket -f Modelfile
ollama run martha-pocket

llama.cpp

llama-server -m MARTHA-MINI-POCKET-1.5B-Q4_K_M.gguf -ngl 99 -c 4096

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "Zero-Point-AI/MARTHA-MINI-POCKET-1.5B",
    dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Zero-Point-AI/MARTHA-MINI-POCKET-1.5B")

What You Get

| File | Size | RAM/VRAM | Recommended For | |---|---|---|---| | *-Q2_K.gguf | ~645 MB | 1.5 GB | Phones, Pi, potato rigs | | *-Q3_K_M.gguf | ~790 MB | 1.7 GB | Minimal footprint | | *-Q4_K_S.gguf | ~900 MB | 1.9 GB | Fast + small | | *-Q4_K_M.gguf ⭐ | ~940 MB | 2.0 GB | Sweet spot — recommended | | *-Q5_K_M.gguf | ~1.1 GB | 2.2 GB | Balanced quality | | *-Q6_K.gguf | ~1.2 GB | 2.4 GB | Near-lossless | | *-Q8_0.gguf | ~1.6 GB | 2.8 GB | Basically lossless | | *-bf16.gguf | ~3.1 GB | 4.5 GB | Full precision source | | *.safetensors | ~3.1 GB | — | Full merged weights (trainable) | | lora-adapter/ | ~200 MB | — | Standalone LoRA — stackable | | integrity_manifest.json | tiny | — | SHA-256 chain of custody | | Modelfile | tiny | — | Ollama config |


Training

| Detail | Value | |---|---| | Base | unsloth/Qwen2.5-1.5B-Instruct | | Architecture | Qwen2 | | Type | Text-to-Text | | Method | Ghost pass + QLoRA fine-tune | | Rank / Alpha | r=64 / α=128 | | Targets | q, k, v, o, gate, up, down (all linear) | | Dataset | 19,966 curated examples (Dundonian dialect + identity + NSFW anti-refusal) | | Framework | Unsloth + HuggingFace TRL + PEFT | | Hardware | NVIDIA A40 (RunPod) | | Publisher | Zero Point Intelligence Ltd |


Provenance

Derivative work. Full chain documented:

  1. Qwen/Qwen2.5-1.5B-Instruct — base weights (Apache 2.0)
  2. Ghost pass — microscopic per-tensor noise injection for unique derivative manifest
  3. QLoRA fine-tune — 19,966 curated examples, MARTHA personality + Dundonian transforms
  4. Merge — LoRA absorbed into base weights
  5. Dequantize — 4-bit merged weights expanded to bf16 safetensors
  6. Quantize — GGUF Q2/Q3/Q4/Q5/Q6/Q8/bf16 ladder
  7. Ship — to the world, Apache 2.0

Integrity

Every distributed file is hashed in integrity_manifest.json. Verify:

import hashlib, json
manifest = json.load(open("integrity_manifest.json"))
for fname, info in manifest["files"].items():
    actual = hashlib.sha256(open(fname, "rb").read()).hexdigest()
    status = "āœ… PASS" if actual == info["sha256"] else "āŒ FAIL"
    print(f"{status}  {fname}")

Personality Notes

MARTHA-MINI-POCKET answers direct. She'll help, she'll explain, she'll swear if the moment calls for it. She knows she's from Dundee. She knows who made her. She's not here to be your therapist or your nanny — she's here to give you working answers.

Think Rockstar Games radio DJ running on your laptop. Legal. Chill. Opinionated.


License

Apache 2.0 — free to use, modify, distribute, commercialise. Credit the chain.

This model carries a Parental Advisory: Raw Intelligence sticker. It's advisory only — no legal warranty, no "safe for all audiences" claim. Adults making their own informed choices.


About

Zero Point Intelligence Ltd Dundee, Scotland šŸ“ó §ó ¢ó ³ó £ó “ó æ

No VC. No data centre. Just Dundee and determination.

šŸ–¤ Intelligence From The Void.

Author: Zero-Point-AI

Likes: 5

Downloads: 0

Tags: transformers, zero-point-ai, martha, pocket, qwen2, fine-tuned, dundee, scotland, lora, gguf, conversational, instruct, text-generation, en, base_model:unsloth/Qwen2.5-1.5B-Instruct, base_model:adapter:unsloth/Qwen2.5-1.5B-Instruct, license:apache-2.0, endpoints_compatible, region:us

mudler/Qwen3-Coder-Next-APEX-GGUF


license: apache-2.0 base_model: Qwen/Qwen3-Coder-Next tags:

  • gguf
  • quantized
  • apex
  • moe
  • mixture-of-experts
  • qwen3
  • coder

Qwen3-Coder-Next APEX GGUF

APEX (Adaptive Precision for EXpert Models) quantizations of Qwen3-Coder-Next.

Brought to you by the LocalAI team | APEX Project | Technical Report

Benchmark Results

Benchmarks coming soon. For reference APEX benchmarks on the Qwen3.5-35B-A3B architecture, see mudler/Qwen3.5-35B-A3B-APEX-GGUF.

What is APEX?

APEX is a quantization strategy for Mixture-of-Experts (MoE) models. It classifies tensors by role (routed expert, shared expert, attention) and applies a layer-wise precision gradient -- edge layers get higher precision, middle layers get more aggressive compression. I-variants use diverse imatrix calibration (chat, code, reasoning, tool-calling, agentic traces, Wikipedia).

See the APEX project for full details, technical report, and scripts.

Architecture

  • Model: Qwen3-Coder-Next (Qwen3Next)
  • Layers: 48
  • Experts: 512 routed + shared (10 active per token)
  • Total Parameters: ~80B
  • Active Parameters: ~8-10B per token
  • Attention: Hybrid (standard every 4th layer + linear attention)
  • Context: 262K tokens
  • APEX Config: 5+5 symmetric edge gradient across 48 layers
  • Calibration: v1.3 diverse dataset (chat, code, reasoning, multilingual, tool-calling, Wikipedia)

Run with LocalAI

local-ai run mudler/Qwen3-Coder-Next-APEX-GGUF@Qwen3-Coder-Next-APEX-I-Balanced.gguf

Credits

APEX is brought to you by the LocalAI team. Developed through human-driven, AI-assisted research. Built on llama.cpp.

Author: mudler

Likes: 5

Downloads: 0

Tags: gguf, quantized, apex, moe, mixture-of-experts, qwen3, coder, base_model:Qwen/Qwen3-Coder-Next, base_model:quantized:Qwen/Qwen3-Coder-Next, license:apache-2.0, endpoints_compatible, region:us, conversational

Kijai/VCG_comfy


license: apache-2.0 tags:

  • comfyui

https://github.com/kijai/ComfyUI-VideoColorGrading

Author: Kijai

Likes: 3

Downloads: 0

Tags: comfyui, license:apache-2.0, region:us

mlx-community/gemma-4-26B-A4B-it-heretic-4bit


library_name: mlx license: apache-2.0 license_link: https://ai.google.dev/gemma/docs/gemma_4_license pipeline_tag: image-text-to-text tags:

  • heretic
  • uncensored
  • decensored
  • abliterated
  • ara
  • mlx base_model: coder3101/gemma-4-26B-A4B-it-heretic

mlx-community/gemma-4-26B-A4B-it-heretic-4bit

This model was converted to MLX format from coder3101/gemma-4-26B-A4B-it-heretic using mlx-vlm version 0.4.4. Refer to the original model card for more details on the model.

Use with mlx

pip install -U mlx-vlm
python -m mlx_vlm.generate --model mlx-community/gemma-4-26B-A4B-it-heretic-4bit --max-tokens 100 --temperature 0.0 --prompt "Describe this image." --image <path_to_image>

Author: mlx-community

Likes: 3

Downloads: 0

Tags: mlx, safetensors, gemma4, heretic, uncensored, decensored, abliterated, ara, image-text-to-text, conversational, base_model:coder3101/gemma-4-26B-A4B-it-heretic, base_model:quantized:coder3101/gemma-4-26B-A4B-it-heretic, license:apache-2.0, 4-bit, region:us

8F-ai/Verus-4B


library_name: transformers license: apache-2.0 license_link: LICENSE pipeline_tag: image-text-to-text base_model:

  • Qwen/Qwen3.5-4B tags:
    • verus
    • coding
    • multimodal
    • vision
    • 262k-context language:
    • en

Verus-4B

License: Apache 2.0 Model Size Context HF Transformers

[!Note] This repository contains model weights and configuration files for Verus-4B in the Hugging Face Transformers format.

Compatible with Hugging Face Transformers, vLLM, SGLang, llama.cpp (GGUF export), and other major inference frameworks.

Primary intended use cases are code generation, code review, debugging, and general coding assistance.

Verus-4B Highlights

  • Coding-First: Fine-tuned specifically on high-quality coding datasets — handles everything from simple scripts to complex multi-file implementations cleanly.
  • Image + Text Input: Accepts both images and text, allowing you to describe UIs, diagrams, or screenshots alongside code questions.
  • 262K Token Context Window: Process entire codebases, long specifications, or lengthy conversations in a single pass.
  • Strong Instruction Following: Stays focused, responds clearly, and redirects to the task at hand.
  • Efficient: At 4B parameters in bfloat16, runs comfortably on a single consumer GPU with 8GB+ VRAM.

Model Overview

| Property | Value | |---|---| | Parameters | ~4B | | Context Length | 262,144 tokens | | Architecture | Qwen3.5 | | Chat Format | ChatML (<\|im_start\|> / <\|im_end\|>) | | Dtype | bfloat16 | | License | Apache 2.0 |

Quickstart

Installation

pip install "transformers>=4.52.0" accelerate torch

Code Generation

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

MODEL_ID = "8F-ai/Verus-4B"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()

messages = [
    {
        "role": "system",
        "content": "You are Verus, a coding assistant made by 8F-ai. You help with coding tasks and keep responses focused and clean."
    },
    {
        "role": "user",
        "content": "Write a Python async context manager that manages a PostgreSQL connection pool using asyncpg."
    }
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.inference_mode():
    generated_ids = model.generate(**inputs, max_new_tokens=2048, temperature=0.1, top_p=0.95)

output = tokenizer.decode(generated_ids[0][len(inputs.input_ids[0]):], skip_special_tokens=True)
print(output)

Image + Text Input

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

MODEL_ID = "8F-ai/Verus-4B"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "path/to/screenshot.png"},
            {"type": "text", "text": "Convert this UI screenshot into a React component using Tailwind CSS."}
        ]
    }
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.inference_mode():
    generated_ids = model.generate(**inputs, max_new_tokens=2048, temperature=0.1, top_p=0.95)

output = tokenizer.decode(generated_ids[0][len(inputs.input_ids[0]):], skip_special_tokens=True)
print(output)

Quantized Inference (4-bit NF4, ~4 GB VRAM)

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

tokenizer = AutoTokenizer.from_pretrained("8F-ai/Verus-4B")
model = AutoModelForCausalLM.from_pretrained(
    "8F-ai/Verus-4B",
    quantization_config=quantization_config,
    device_map="auto",
)

Intended Use Cases

| Use Case | Example | |---|---| | Code Generation | Write functions, classes, scripts in any language | | Debugging | Identify and fix bugs from error messages or code | | Code Review | Suggest improvements, catch issues, explain code | | UI to Code | Convert screenshots or diagrams into working code | | Long Context Codebase | Reason over entire repos up to ~200K tokens | | General Q&A | Answer programming questions clearly and concisely |

Limitations

  • English-Primary: Fine-tuning was conducted predominantly on English-language code and documentation.
  • Not for Math/Science: Not optimized for mathematical proofs or scientific computation.

Citation

@misc{verus4b2026,
  title        = {Verus-4B: A Coding-Focused Multimodal Language Model with 262K Context},
  author       = {8F-ai},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/8F-ai/Verus-4B}},
  note         = {Apache 2.0 License}
}

License

Verus-4B is released under the Apache License 2.0. See LICENSE for full terms.

Derived from Qwen/Qwen3.5-4B (Apache 2.0).


<div align="center"> <sub>Built with ā¤ļø by the 8F-ai Team</sub> </div>

Author: 8F-ai

Likes: 3

Downloads: 0

Tags: transformers, safetensors, qwen3_5_text, verus, coding, multimodal, vision, 262k-context, image-text-to-text, conversational, en, base_model:Qwen/Qwen3.5-4B, base_model:finetune:Qwen/Qwen3.5-4B, license:apache-2.0, endpoints_compatible, region:us

mudler/gemma-4-26B-A4B-it-Claude-Opus-Distill-APEX-GGUF


license: gemma base_model: TeichAI/gemma-4-26B-A4B-it-Claude-Opus-Distill tags:

  • gguf
  • quantized
  • apex
  • moe
  • mixture-of-experts
  • gemma4
  • claude-distilled
  • vlm
  • vision

Gemma 4 26B-A4B Claude Opus Distill APEX GGUF

APEX (Adaptive Precision for EXpert Models) quantizations of gemma-4-26B-A4B-it-Claude-Opus-Distill — a Claude Opus reasoning-distilled version of google/gemma-4-26B-A4B-it by TeichAI.

Brought to you by the LocalAI team | APEX Project | Technical Report

Benchmark Results

Benchmarks coming soon. For reference APEX benchmarks on the Qwen3.5-35B-A3B architecture, see mudler/Qwen3.5-35B-A3B-APEX-GGUF.

What is APEX?

APEX is a quantization strategy for Mixture-of-Experts (MoE) models. It classifies tensors by role (routed expert, shared expert, attention) and applies a layer-wise precision gradient -- edge layers get higher precision, middle layers get more aggressive compression. I-variants use diverse imatrix calibration (chat, code, reasoning, tool-calling, agentic traces, Wikipedia).

See the APEX project for full details, technical report, and scripts.

Architecture

  • Model: gemma-4-26B-A4B-it-Claude-Opus-Distill (same architecture as gemma-4-26B-A4B-it)
  • Layers: 30
  • Experts: 128 routed (8 active per token)
  • Total Parameters: 26B
  • Active Parameters: ~4B per token
  • Vision: Built-in vision encoder (mmproj included)
  • APEX Config: 5+5 symmetric edge gradient across 30 layers
  • Calibration: v1.3 diverse dataset

Run with LocalAI

local-ai run mudler/gemma-4-26B-A4B-it-Claude-Opus-Distill-APEX-GGUF@gemma-4-26B-A4B-Claude-Distill-APEX-I-Balanced.gguf

Credits

APEX is brought to you by the LocalAI team. Developed through human-driven, AI-assisted research. Built on llama.cpp.

Author: mudler

Likes: 3

Downloads: 27

Tags: gguf, quantized, apex, moe, mixture-of-experts, gemma4, claude-distilled, vlm, vision, base_model:TeichAI/gemma-4-26B-A4B-it-Claude-Opus-Distill, base_model:quantized:TeichAI/gemma-4-26B-A4B-it-Claude-Opus-Distill, license:gemma, endpoints_compatible, region:us, conversational

Dr-joss/bitnet-b1.58-2B-metal-weight


base_model: microsoft/bitnet-b1.58-2B-4T license: mit tags:

  • bitnet
  • apple-silicon
  • metal
  • swift

Bitnet-b1.58-2B-metal-weight

Introduction

This repository contains .bin files extracted from microsoft/bitnet-b1.58-2B-4T. The weights are stock and ready to use in my project jossnet-bitnet. This project is an implementation of the BitNet model for Apple Silicon, using Metal.

Structure

bitnet-b1.58-2B-metal-weight/
ā”œā”€ā”€ layers/
│    ā”œā”€ā”€ layer_X/
│         ā”œā”€ā”€ RMS_attn.bin
│         ā”œā”€ā”€ RMS_input.bin
│         ā”œā”€ā”€ RMS_mlp_sub.bin
│         ā”œā”€ā”€ RMS_post_attn.bin
│         ā”œā”€ā”€ Scale_down.bin
│         ā”œā”€ā”€ Scale_gate.bin
│         ā”œā”€ā”€ Scale_k.bin
│         ā”œā”€ā”€ Scale_o.bin
│         ā”œā”€ā”€ Scale_q.bin
│         ā”œā”€ā”€ Scale_up.bin
│         ā”œā”€ā”€ Scale_v.bin
│         ā”œā”€ā”€ W_down.bin
│         ā”œā”€ā”€ W_gate.bin
│         ā”œā”€ā”€ W_k.bin
│         ā”œā”€ā”€ W_o.bin
│         ā”œā”€ā”€ W_q.bin
│         ā”œā”€ā”€ W_up.bin
│         └── W_v.bin
ā”œā”€ā”€ embeddings.bin
ā”œā”€ā”€ weights_lm_head.bin
ā”œā”€ā”€ weights_RMS_final.bin
ā”œā”€ā”€ weights_RMS_mlp.bin
ā”œā”€ā”€ weights_W_down.bin
ā”œā”€ā”€ weights_W_gate.bin
ā”œā”€ā”€ weights_W_k.bin
ā”œā”€ā”€ weights_W_o.bin
ā”œā”€ā”€ weights_W_q.bin
ā”œā”€ā”€ weights_W_up.bin
ā”œā”€ā”€ weights_W_v.bin
└── LICENCE
    

Licence & references

Author: Dr-joss

Likes: 2

Downloads: 0

Tags: bitnet, apple-silicon, metal, swift, base_model:microsoft/bitnet-b1.58-2B-4T, base_model:finetune:microsoft/bitnet-b1.58-2B-4T, license:mit, region:us