Todays AI Summary

AI Developments: Enhanced Image Generation, Safer Robotics, and Model Interpretability

Today's AI landscape showcases advancements across various domains, including image generation, robotics, and model understanding. New models and research papers are pushing the boundaries of what's possible, with a focus on improving performance, safety, and transparency.

Research Highlights

Several interesting research papers have emerged:

  • SafeBimanual: Diffusion-based Trajectory Optimization for Safe Bimanual Manipulation: This paper introduces a framework called SafeBimanual, which enhances the safety of diffusion-based bimanual manipulation policies. It uses a vision-language model (VLM) to schedule cost functions, ensuring safety constraints are dynamically generated during the manipulation process. Experiments show improved success rates and reduced unsafe interactions in both simulated and real-world tasks.
  • Type-Compliant Adaptation Cascades: Adapting Programmatic LM Workflows to Data: This paper presents Type-Compliant Adaptation Cascades (TACs), a framework that adapts LLM workflows by learning typed probabilistic programs. TACs treats the entire workflow as an unnormalized joint distribution, enabling gradient-based training. The framework demonstrates significant performance improvements on structured tasks compared to state-of-the-art prompt-optimization baselines.
  • Disentangling the Factors of Convergence between Brains and Computer Vision Models: This research explores the factors that drive brain-model similarity in AI models trained on natural images. By systematically varying model size, training amount, and image type, the study reveals that all three factors independently and interactively impact brain similarity metrics. The largest models trained with human-centric images achieve the highest brain-similarity.
  • ST-Raptor: LLM-Powered Semi-Structured Table Question Answering: This paper introduces ST-Raptor, a tree-based framework for semi-structured table question answering using large language models. It uses a Hierarchical Orthogonal Tree (HO-Tree) to capture complex table layouts and incorporates a two-stage verification mechanism for accurate pipeline execution. Experiments show that ST-Raptor outperforms nine baselines by up to 20% in answer accuracy on a new dataset of real-world semi-structured tables.
  • Explain and Monitor Deep Learning Models for Computer Vision using Obz AI: This paper introduces Obz AI, a software ecosystem designed to facilitate explainability and observability for vision AI systems. Obz AI provides a seamless integration pipeline, from a Python client library to a full-stack analytics dashboard, allowing machine learning engineers to incorporate advanced XAI methodologies and monitor AI models in real time.

Model Updates

Several new models have been released, focusing on image generation and multimodal capabilities:

  • Shakker-Labs/AWPortrait-QW: This LoRA model, based on Qwen-Image, is trained to generate portraits with enhanced Chinese facial features and aesthetics. It covers a wide range of genres and delivers more detailed and realistic skin texture compared to the original Qwen-Image.
  • OpenBMB/MiniCPM-V-4_5-AWQ: MiniCPM-V 4.5 is a multimodal model built on Qwen3-8B and SigLIP2-400M, featuring state-of-the-art vision-language capabilities. It achieves high performance on benchmarks, supports efficient video understanding, and excels in OCR and document parsing. The model is designed for easy usage on various platforms, including local devices and iOS apps.
  • DavidAU/Qwen3-MOE-4x0.6B-2.4B-Writing-Thunder: This Mixture of Experts (MOE) model combines four 0.6B models in a Mixtral-type Qwen 3 structure. It is designed for high-speed creative tasks and general use, with a native context of 40k. The model allows users to adjust the number of active experts for different tasks, such as coding and general writing.

Key Takeaways

  • Safety in Robotics: Research is increasingly focusing on ensuring the safety of AI-driven robotic systems, particularly in complex manipulation tasks.
  • Structured Learning: Frameworks like TACs are enabling more reliable and task-compliant LLM systems by learning typed probabilistic programs.
  • Model Interpretability: Tools like Obz AI are making deep learning models more transparent and understandable, promoting responsible deployment in computer vision applications.
  • Multimodal Advancements: Models like MiniCPM-V 4.5 are pushing the boundaries of multimodal AI, offering improved performance in vision-language tasks, video understanding, and document processing.
  • Efficient Image Generation: LoRA models like AWPortrait-QW are enabling more specialized and high-quality image generation with enhanced control over specific features and

AI Papers for 2026-04-09

In-Place Test-Time Training

The static ``train then deploy" paradigm fundamentally limits Large Language Models (LLMs) from dynamically adapting their weights in response to continuous streams of new information inherent in real-world tasks. Test-Time Training (TTT) offers a compelling alternative by updating a subset of model parameters (fast weights) at inference time, yet its potential in the current LLM ecosystem is hindered by critical barriers including architectural incompatibility, computational inefficiency and misaligned fast weight objectives for language modeling. In this work, we introduce In-Place Test-Time Training (In-Place TTT), a framework that seamlessly endows LLMs with Test-Time Training ability. In-Place TTT treats the final projection matrix of the ubiquitous MLP blocks as its adaptable fast weights, enabling a ``drop-in" enhancement for LLMs without costly retraining from scratch. Furthermore, we replace TTT's generic reconstruction objective with a tailored, theoretically-grounded objective explicitly aligned with the Next-Token-Prediction task governing autoregressive language modeling. This principled objective, combined with an efficient chunk-wise update mechanism, results in a highly scalable algorithm compatible with context parallelism. Extensive experiments validate our framework's effectiveness: as an in-place enhancement, it enables a 4B-parameter model to achieve superior performance on tasks with contexts up to 128k, and when pretrained from scratch, it consistently outperforms competitive TTT-related approaches. Ablation study results further provide deeper insights on our design choices. Collectively, our results establish In-Place TTT as a promising step towards a paradigm of continual learning in LLMs.

DiffHDR: Re-Exposing LDR Videos with Video Diffusion Models

Most digital videos are stored in 8-bit low dynamic range (LDR) formats, where much of the original high dynamic range (HDR) scene radiance is lost due to saturation and quantization. This loss of highlight and shadow detail precludes mapping accurate luminance to HDR displays and limits meaningful re-exposure in post-production workflows. Although techniques have been proposed to convert LDR images to HDR through dynamic range expansion, they struggle to restore realistic detail in the over- and underexposed regions. To address this, we present DiffHDR, a framework that formulates LDR-to-HDR conversion as a generative radiance inpainting task within the latent space of a video diffusion model. By operating in Log-Gamma color space, DiffHDR leverages spatio-temporal generative priors from a pretrained video diffusion model to synthesize plausible HDR radiance in over- and underexposed regions while recovering the continuous scene radiance of the quantized pixels. Our framework further enables controllable LDR-to-HDR video conversion guided by text prompts or reference images. To address the scarcity of paired HDR video data, we develop a pipeline that synthesizes high-quality HDR video training data from static HDRI maps. Extensive experiments demonstrate that DiffHDR significantly outperforms state-of-the-art approaches in radiance fidelity and temporal stability, producing realistic HDR videos with considerable latitude for re-exposure.

MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control

MLLMs have been successfully applied to multimodal embedding tasks, yet their generative reasoning capabilities remain underutilized. Directly incorporating chain-of-thought reasoning into embedding learning introduces two fundamental challenges. First, structural misalignment between instance-level reasoning and pairwise contrastive supervision may lead to shortcut behavior, where the model merely learns the superficial format of reasoning. Second, reasoning is not universally beneficial for embedding tasks. Enforcing reasoning for all inputs may introduce unnecessary computation and latency, and can even obscure salient semantic signals for simple cases. To address these issues, we propose MMEmb-R1, an adaptive reasoning-based multimodal embedding framework. We formulate reasoning as a latent variable and introduce pair-aware reasoning selection that employs counterfactual intervention to identify reasoning paths beneficial for query-target alignment. Furthermore, we adopt reinforcement learning to selectively invoke reasoning only when necessary. Experiments on the MMEB-V2 benchmark demonstrate that our model achieves a score of 71.2 with only 4B parameters, establishing a new state-of-the-art while significantly reducing reasoning overhead and inference latency.

Toward Consistent World Models with Multi-Token Prediction and Latent Semantic Enhancement

Whether Large Language Models (LLMs) develop coherent internal world models remains a core debate. While conventional Next-Token Prediction (NTP) focuses on one-step-ahead supervision, Multi-Token Prediction (MTP) has shown promise in learning more structured representations. In this work, we provide a theoretical perspective analyzing the gradient inductive bias of MTP, supported by empirical evidence, showing that MTP promotes the convergence toward internal belief states by inducing representational contractivity via gradient coupling. However, we reveal that standard MTP often suffers from structural hallucinations, where discrete token supervision encourages illegal shortcuts in latent space that violate environmental constraints. To address this, we propose a novel method Latent Semantic Enhancement MTP (LSE-MTP), which anchors predictions to ground-truth hidden state trajectories. Experiments on synthetic graphs and real-world Manhattan Taxi Ride show that LSE-MTP effectively bridges the gap between discrete tokens and continuous state representations, enhancing representation alignment, reducing structural hallucinations, and improving robustness to perturbations.

Who Governs the Machine? A Machine Identity Governance Taxonomy (MIGT) for AI Systems Operating Across Enterprise and Geopolitical Boundaries

The governance of artificial intelligence has a blind spot: the machine identities that AI systems use to act. AI agents, service accounts, API tokens, and automated workflows now outnumber human identities in enterprise environments by ratios exceeding 80 to 1, yet no integrated framework exists to govern them. A single ungoverned automated agent produced $5.4-10 billion in losses in the 2024 CrowdStrike outage; nation-state actors including Silk Typhoon and Salt Typhoon have operationalized ungoverned machine credentials as primary espionage vectors against critical infrastructure. This paper makes four original contributions. First, the AI-Identity Risk Taxonomy (AIRT): a comprehensive enumeration of 37 risk sub-categories across eight domains, each grounded in documented incidents, regulatory recognition, practitioner prevalence data, and threat intelligence. Second, the Machine Identity Governance Taxonomy (MIGT): an integrated six-domain governance framework simultaneously addressing the technical governance gap, the regulatory compliance gap, and the cross-jurisdictional coordination gap that existing frameworks address only in isolation. Third, a foreign state actor threat model for enterprise identity governance, establishing that Silk Typhoon, Salt Typhoon, Volt Typhoon, and North Korean AI-enhanced identity fraud operations have already operationalized AI identity vulnerabilities as active attack vectors. Fourth, a cross-jurisdictional regulatory alignment structure mapping enterprise AI identity governance obligations under EU, US, and Chinese frameworks simultaneously, identifying irreconcilable conflicts and providing a governance mechanism for managing them. A four-phase implementation roadmap translates the MIGT into actionable enterprise programs.

Generating Synthetic Doctor-Patient Conversations for Long-form Audio Summarization

Long-context audio reasoning is underserved in both training data and evaluation. Existing benchmarks target short-context tasks, and the open-ended generation tasks most relevant to long-context reasoning pose well-known challenges for automatic evaluation. We propose a synthetic data generation pipeline designed to serve both as a training resource and as a controlled evaluation environment, and instantiate it for first-visit doctor-patient conversations with SOAP note generation as the task. The pipeline has three stages, persona-driven dialogue generation, multi-speaker audio synthesis with overlap/pause modeling, room acoustics, and sound events, and LLM-based reference SOAP note production, built entirely on open-weight models. We release 8,800 synthetic conversations with 1.3k hours of corresponding audio and reference notes. Evaluating current open-weight systems, we find that cascaded approaches still substantially outperform end-to-end models.

Shot-Based Quantum Encoding: A Data-Loading Paradigm for Quantum Neural Networks

Efficient data loading remains a bottleneck for near-term quantum machine-learning. Existing schemes (angle, amplitude, and basis encoding) either underuse the exponential Hilbert-space capacity or require circuit depths that exceed the coherence budgets of noisy intermediate-scale quantum hardware. We introduce Shot-Based Quantum Encoding (SBQE), a data embedding strategy that distributes the hardware's native resource, shots, according to a data-dependent classical distribution over multiple initial quantum states. By treating the shot counts as a learnable degree of freedom, SBQE produces a mixed-state representation whose expectation values are linear in the classical probabilities and can therefore be composed with non-linear activation functions. We show that SBQE is structurally equivalent to a multilayer perceptron whose weights are realised by quantum circuits, and we describe a hardware-compatible implementation protocol. Benchmarks on Fashion MNIST and Semeion handwritten digits, with ten independent initialisations per model, show that SBQE achieves 89.1% +/- 0.9% test accuracy on Semeion (reducing error by 5.3% relative to amplitude encoding and matching a width-matched classical network) and 80.95% +/- 0.10% on Fashion MNIST (exceeding amplitude encoding by +2.0% and a linear multilayer perceptron by +1.3%), all without any data-encoding gates.

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

Large language models are increasingly deployed as autonomous agents executing multi-step workflows in real-world software environments. However, existing agent benchmarks suffer from three critical limitations: (1) trajectory-opaque grading that checks only final outputs, (2) underspecified safety and robustness evaluation, and (3) narrow modality coverage and interaction paradigms. We introduce Claw-Eval, an end-to-end evaluation suite addressing all three gaps. It comprises 300 human-verified tasks spanning 9 categories across three groups (general service orchestration, multimodal perception and generation, and multi-turn professional dialogue). Every agent action is recorded through three independent evidence channels (execution traces, audit logs, and environment snapshots), enabling trajectory-aware grading over 2,159 fine-grained rubric items. The scoring protocol evaluates Completion, Safety, and Robustness, reporting Average Score, Pass@k, and Pass^k across three trials to distinguish genuine capability from lucky outcomes. Experiments on 14 frontier models reveal that: (1) trajectory-opaque evaluation is systematically unreliable, missing 44% of safety violations and 13% of robustness failures that our hybrid pipeline catches; (2) controlled error injection primarily degrades consistency rather than peak capability, with Pass^3 dropping up to 24% while Pass@3 remains stable; (3) multimodal performance varies sharply, with most models performing poorer on video than on document or image, and no single model dominating across all modalities. Beyond benchmarking, Claw-Eval highlights actionable directions for agent development, shedding light on what it takes to build agents that are not only capable but reliably deployable.

PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer

This paper introduces the Polynomial Mixer (PoM), a novel token mixing mechanism with linear complexity that serves as a drop-in replacement for self-attention. PoM aggregates input tokens into a compact representation through a learned polynomial function, from which each token retrieves contextual information. We prove that PoM satisfies the contextual mapping property, ensuring that transformers equipped with PoM remain universal sequence-to-sequence approximators. We replace standard self-attention with PoM across five diverse domains: text generation, handwritten text recognition, image generation, 3D modeling, and Earth observation. PoM matches the performance of attention-based models while drastically reducing computational cost when working with long sequences. The code is available at https://github.com/davidpicard/pom.

Gym-Anything: Turn any Software into an Agent Environment

Computer-use agents hold the promise of assisting in a wide range of digital economic activities. However, current research has largely focused on short-horizon tasks over a limited set of software with limited economic value, such as basic e-commerce and OS-configuration tasks. A key reason is that creating environments for complex software requires significant time and human effort, and therefore does not scale. To address this, we introduce Gym-Anything, a framework for converting any software into an interactive computer-use environment. We frame environment creation itself as a multi-agent task: a coding agent writes setup scripts, downloads real-world data, and configures the software, while producing evidence of correct setup. An independent audit agent then verifies evidence for the environment setup against a quality checklist. Using a taxonomy of economically valuable occupations grounded in U.S. GDP data, we apply this pipeline to 200 software applications with broad occupational coverage. The result is CUA-World, a collection of over 10K long-horizon tasks spanning domains from medical science and astronomy to engineering and enterprise systems, each configured with realistic data along with train and test splits. CUA-World also includes CUA-World-Long, a challenging long-horizon benchmark with tasks often requiring over 500 steps, far exceeding existing benchmarks. Distilling successful trajectories from the training split into a 2B vision-language model outperforms models 2$\times$ its size. We also apply the same auditing principle at test time: a separate VLM reviews completed trajectories and provides feedback on what remains, improving Gemini-3-Flash on CUA-World-Long from 11.5% to 14.0%. We release all code, infrastructure, and benchmark data to facilitate future research in realistic computer-use agents.

AI Models

LiquidAI/LFM2.5-VL-450M


library_name: transformers license: other license_name: lfm1.0 license_link: LICENSE language:

  • en
  • ja
  • ko
  • fr
  • es
  • de
  • ar
  • zh
  • pt pipeline_tag: image-text-to-text tags:
  • liquid
  • lfm2
  • lfm2-vl
  • edge
  • lfm2.5-vl
  • lfm2.5 base_model: LiquidAI/LFM2.5-350M

<center> <div style="text-align: center;"> <img src="https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/2b08LKpev0DNEk6DlnWkY.png" alt="Liquid AI" style="width: 100%; max-width: 100%; height: auto; display: inline-block; margin-bottom: 0.5em; margin-top: 0.5em;" /> </div> <div style="display: flex; justify-content: center; gap: 0.5em;"> <a href="https://playground.liquid.ai/chat?model=lfm2.5-vl-450m"><strong>Try LFM</strong></a> • <a href="https://docs.liquid.ai/lfm/getting-started/welcome"><strong>Docs</strong></a> • <a href="https://leap.liquid.ai/"><strong>LEAP</strong></a> • <a href="https://discord.com/invite/liquid-ai"><strong>Discord</strong></a> </div> </center> <br>

LFM2.5‑VL-450M

LFM2.5‑VL-450M is Liquid AI's refreshed version of the first vision-language model, LFM2-VL-450M, built on an updated backbone LFM2.5-350M and tuned for stronger real-world performance. Find more about LFM2.5 family of models in our blog post.

  • Enhanced instruction following on vision and language tasks.
  • Improved multilingual vision understanding in Arabic, Chinese, French, German, Japanese, Korean, Portuguese and Spanish.
  • Bounding box prediction and object detection for grounded visual understanding.
  • Function calling support for text-only input.

🎥⚡️ You can try LFM2.5-VL-450M running locally in your browser with our real-time video stream captioning WebGPU demo 🎥⚡️

Alternatively, try the API model on the Playground.

📄 Model details

LFM2.5-VL-450M is a general-purpose vision-language model with the following features:

  • LM Backbone: LFM2.5-350M
  • Vision encoder: SigLIP2 NaFlex shape‑optimized 86M
  • Context length: 32,768 tokens
  • Vocabulary size: 65,536
  • Languages: English, Arabic, Chinese, French, German, Japanese, Korean, Portuguese, and Spanish
  • Native resolution processing: handles images up to 512*512 pixels without upscaling and preserves non-standard aspect ratios without distortion
  • Tiling strategy: splits large images into non-overlapping 512×512 patches and includes thumbnail encoding for global context
  • Inference-time flexibility: user-tunable maximum image tokens and tile count for speed/quality tradeoff without retraining
  • Generation parameters:
    • text: temperature=0.1, min_p=0.15, repetition_penalty=1.05
    • vision: min_image_tokens=32 max_image_tokens=256, do_image_splitting=True

| Model | Description | |-------|-------------| | LFM2.5-VL-450M | Original model checkpoint in native format. Best for fine-tuning or inference with Transformers and vLLM. | | LFM2.5-VL-450M-GGUF | Quantized format for llama.cpp and compatible tools. Optimized for CPU inference and local deployment with reduced memory usage. | | LFM2.5-VL-450M-ONNX | ONNX Runtime format for cross-platform deployment. Enables hardware-accelerated inference across diverse environments (cloud, edge, mobile). | | LFM2.5-VL-450M-MLX-8bit | MLX format for Apple Silicon. Optimized for fast on-device inference on Mac with mlx-vlm. Also available in 4bit, 5bit, 6bit, and bf16. |

We recommend using it for general vision-language workloads, captioning and object detection. It’s not well-suited for knowledge-intensive tasks or fine-grained OCR.

Chat Template

LFM2.5-VL uses a ChatML-like format. See the Chat Template documentation for details.

<|startoftext|><|im_start|>system
You are a helpful multimodal assistant by Liquid AI.<|im_end|>
<|im_start|>user
<image>Describe this image.<|im_end|>
<|im_start|>assistant
This image shows a Caenorhabditis elegans (C. elegans) nematode.<|im_end|>

You can use processor.apply_chat_template() to format your messages automatically.

🏃 Inference

You can run LFM2.5-VL-450M with Hugging Face transformers v5.1 or newer:

pip install transformers pillow
from transformers import AutoProcessor, AutoModelForImageTextToText
from transformers.image_utils import load_image

# Load model and processor
model_id = "LiquidAI/LFM2.5-VL-450M"
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    device_map="auto",
    dtype="bfloat16"
)
processor = AutoProcessor.from_pretrained(model_id)

# Load image and create conversation
url = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
image = load_image(url)
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "What is in this image?"},
        ],
    },
]

# Generate Answer
inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
    tokenize=True,
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=64)
processor.batch_decode(outputs, skip_special_tokens=True)[0]

# This image captures the iconic Statue of Liberty standing majestically on Liberty Island in New York City. The statue, a symbol of freedom and democracy, is prominently featured in the foreground, its greenish-gray hue contrasting beautifully with the surrounding water.

Visual grounding

LFM2.5-VL-450M supports bounding box prediction:

url = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
image = load_image(url)
query = "status"
prompt = f'Detect all instances of: {query}. Response must be a JSON array: [{"label": ..., "bbox": [x1, y1, x2, y2]}, ...]. Coordinates are normalized to [0,1].'

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": prompt},
        ],
    },
]

# Generate Answer
inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
    tokenize=True,
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=64)
processor.batch_decode(outputs, skip_special_tokens=True)[0]

# [{"label": "statue", "bbox": [0.3, 0.25, 0.4, 0.65]}]

Tool Use

LFM2.5 supports function calling for text only input by applying the chat template with the tokenizer. See the Tool Use documentation for the full guide.

tools = [{
    "name": "get_weather",
    "description": "Get current weather for a location",
    "parameters": {
        "type": "object",
        "properties": {"location": {"type": "string"}},
        "required": ["location"]
    }
}]

messages = [{"role": "user", "content": "What's the weather in Paris?"}]

# Apply chat template with tools
inputs = processor.tokenizer.apply_chat_template(
    messages,
    tools=tools,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
)
input_ids = inputs["input_ids"].to(model.device)
outputs = model.generate(input_ids, max_new_tokens=256)
response = processor.tokenizer.decode(outputs[0, input_ids.shape[1]:], skip_special_tokens=False)

# <|tool_call_start|>[get_weather(location="Paris")]<|tool_call_end|>I am retrieving the current weather for Paris.<|im_end|>

| Name | Description | Docs | Notebook | |------|-------------|------|----------| | Transformers | Simple inference with direct access to model internals. | <a href="https://docs.liquid.ai/lfm/inference/transformers#vision-models">Link</a>| <a href="https://colab.research.google.com/drive/1WVQpf4XrHgHFkP0FnlZfx2nK8PugvQNZ?usp=sharing"><img src="https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/vlOyMEjwHa_b_LXysEu2E.png" width="110" alt="Colab link"></a> | | vLLM | High-throughput production deployments with GPU. | <a href="https://docs.liquid.ai/deployment/gpu-inference/vllm#vision-models">Link</a> | <a href="https://colab.research.google.com/drive/1sUfQlqAvuAVB4bZ6akYVQPGmHtTDUNpF?usp=sharing"><img src="https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/vlOyMEjwHa_b_LXysEu2E.png" width="110" alt="Colab link"></a> | | SGLang | High-throughput production deployments with GPU. | <a href="https://docs.liquid.ai/deployment/gpu-inference/sglang#vision-models">Link</a> | <a href="https://colab.research.google.com/drive/1qJlAFag223yFOZGzuMIkYUFhybM9ao5g?usp=sharing"><img src="https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/vlOyMEjwHa_b_LXysEu2E.png" width="110" alt="Colab link"></a> | | llama.cpp | Cross-platform inference with CPU offloading. | <a href="https://docs.liquid.ai/lfm/inference/llama-cpp#vision-models">Link</a> | <a href="https://colab.research.google.com/drive/1q2PjE6O_AahakRlkTNJGYL32MsdUcj7b?usp=sharing"><img src="https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/vlOyMEjwHa_b_LXysEu2E.png" width="110" alt="Colab link"></a> |

🔧 Fine-tuning

We recommend fine-tuning LFM2.5-VL-450M model on your use cases to maximize performance.

| Notebook | Description | Link | |-----------|----------------------------------------------------------------------|------| | SFT (Unsloth) | Supervised Fine-Tuning with LoRA using Unsloth. | <a href="https://colab.research.google.com/drive/1FaR2HSe91YDe88TG97-JVxMygl-rL6vB?usp=sharing"><img src="https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/vlOyMEjwHa_b_LXysEu2E.png" width="110" alt="Colab link"></a> | | SFT (TRL) | Supervised Fine-Tuning with LoRA using TRL. | <a href="https://colab.research.google.com/drive/10530_jt_Joa5zH2wgYlyXosypq1R7PIz?usp=sharing"><img src="https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/vlOyMEjwHa_b_LXysEu2E.png" width="110" alt="Colab link"></a> |

📊 Performance

LFM2.5-VL-450M improves over LFM2-VL-450M across both vision and language benchmarks, while also adding two new capabilities: bounding box prediction on RefCOCO-M and function calling support measured by BFCLv4.

Vision benchmarks

| Model | MMStar | RealWorldQA | MMBench (dev en) | MMMU (val) | POPE | MMVet | BLINK | InfoVQA (val) | OCRBench | MM-IFEval | MMMB | CountBench | RefCOCO-M | |--------------------|--------|-------------|------------------|------------|------|-------|-------|---------------|----------|------------|------|------------|-----------| | LFM2.5-VL-450M | 43.00 | 58.43 | 60.91 | 32.67 | 86.93| 41.10 | 43.92 | 43.02 | 684 | 45.00 | 68.09| 73.31 | 81.28 | | LFM2-VL-450M | 40.87 | 52.03 | 56.27 | 34.44 | 83.79| 33.85 | 42.61 | 44.56 | 657 | 33.09 | 54.29| 47.64 | - | | SmolVLM2-500M | 38.20 | 49.90 | 52.32 | 34.10 | 82.67| 29.90 | 40.70 | 24.64 | 609 | 11.27 | 46.79| 61.81 | - |

All vision benchmark scores are obtained using VLMEvalKit. Multilingual scores are based on the average of benchmarks translated by GPT-4.1-mini from English to Arabic, Chinese, French, German, Japanese, Korean, Portuguese, and Spanish.

Language benchmarks

| Model | GPQA | MMLU Pro | IFEval | Multi-IF | BFCLv4 | |--------------------|------|----------|--------|----------|--------| | LFM2.5-VL-450M | 25.66| 19.32 | 61.16 | 34.63 | 21.08 | | LFM2-VL-450M | 23.13| 17.22 | 51.75 | 26.21 | - | | SmolVLM2-500M | 23.84| 13.57 | 30.14 | 6.82 | - |

📬 Contact

Citation

@article{liquidai2025lfm2,
 title={LFM2 Technical Report},
 author={Liquid AI},
 journal={arXiv preprint arXiv:2511.23404},
 year={2025}
}

Author: LiquidAI

Likes: 50

Downloads: 0

Tags: transformers, safetensors, lfm2_vl, image-text-to-text, liquid, lfm2, lfm2-vl, edge, lfm2.5-vl, lfm2.5, conversational, en, ja, ko, fr, es, de, ar, zh, pt, arxiv:2511.23404, base_model:LiquidAI/LFM2.5-350M, base_model:finetune:LiquidAI/LFM2.5-350M, license:other, endpoints_compatible, region:us

LiquidAI/LFM2.5-VL-450M-ONNX


license: other license_name: lfm1.0 license_link: LICENSE language:

  • en
  • ja
  • ko
  • fr
  • es
  • de
  • it
  • pt
  • ar
  • zh pipeline_tag: image-text-to-text tags:
  • liquid
  • edge
  • lfm2.5-vl
  • lfm2.5
  • onnx
  • onnxruntime
  • webgpu base_model:
  • LiquidAI/LFM2.5-VL-450M

<div align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/2b08LKpev0DNEk6DlnWkY.png" alt="Liquid AI" style="width: 100%; max-width: 100%; height: auto; display: inline-block; margin-bottom: 0.5em; margin-top: 0.5em;" /> <div style="display: flex; justify-content: center; gap: 0.5em; margin-bottom: 1em;"> <a href="https://playground.liquid.ai/chat?model=lfm2.5-vl-450m"><strong>Try LFM</strong></a> • <a href="https://docs.liquid.ai/lfm"><strong>Documentation</strong></a> • <a href="https://leap.liquid.ai/"><strong>LEAP</strong></a> • <a href="https://discord.com/invite/liquid-ai"><strong>Discord</strong></a> </div> </div>

LFM2.5-VL-450M-ONNX

ONNX export of LFM2.5-VL-450M for cross-platform inference.

Recommended Variants

| Encoder | Decoder | Size | Platform | Use Case | |---------|---------|------|----------|----------| | FP16 | Q4 | ~770MB | WebGPU, Server | Recommended for most uses | | FP16 | FP16 | ~1.0GB | Server | Higher quality |

  • WebGPU: Use FP16 encoder + Q4 decoder (Q8 not supported on WebGPU)
  • Server: FP16+Q4 for efficiency, FP16+FP16 for quality

Model Files

onnx/
├── embed_tokens.onnx               # Token embeddings (FP32, 256MB)
├── embed_tokens_fp16.onnx          # Token embeddings (FP16, 128MB)
├── embed_tokens_fp16.onnx_data
├── vision_encoder.onnx             # Vision encoder (FP32, 359MB)
├── vision_encoder.onnx_data
├── vision_encoder_fp16.onnx        # Vision encoder (FP16, 180MB)
├── vision_encoder_fp16.onnx_data
├── vision_encoder_q4.onnx          # Vision encoder (Q4, 57MB)
├── vision_encoder_q4.onnx_data
├── vision_encoder_q8.onnx          # Vision encoder (Q8, 105MB)
├── vision_encoder_q8.onnx_data
├── decoder_model_merged.onnx       # Language decoder (FP32, 1.4GB)
├── decoder_model_merged.onnx_data
├── decoder_model_merged_fp16.onnx  # Language decoder (FP16, 692MB)
├── decoder_model_merged_fp16.onnx_data
├── decoder_model_merged_q4.onnx    # Language decoder (Q4, 459MB)
├── decoder_model_merged_q4.onnx_data
├── decoder_model_merged_q8.onnx    # Language decoder (Q8, 604MB)
└── decoder_model_merged_q8.onnx_data

Python

Installation

pip install onnxruntime transformers pillow torch huggingface_hub
# or with GPU support:
pip install onnxruntime-gpu transformers pillow torch huggingface_hub

Inference

import numpy as np
import onnxruntime as ort
from huggingface_hub import hf_hub_download
from transformers import AutoProcessor
from PIL import Image

# Download model files (fp16 encoder + q4 decoder recommended)
model_id = "LiquidAI/LFM2.5-VL-450M-ONNX"
embed_tokens_path = hf_hub_download(model_id, "onnx/embed_tokens_fp16.onnx")
vision_encoder_path = hf_hub_download(model_id, "onnx/vision_encoder_fp16.onnx")
decoder_path = hf_hub_download(model_id, "onnx/decoder_model_merged_q4.onnx")

# Download all data files
from huggingface_hub import list_repo_files
for f in list_repo_files(model_id):
    if any(f.startswith(f"onnx/{name}") for name in [
        "embed_tokens_fp16.onnx_data",
        "vision_encoder_fp16.onnx_data",
        "decoder_model_merged_q4.onnx_data"
    ]):
        hf_hub_download(model_id, f)

# Load ONNX sessions
embed_tokens = ort.InferenceSession(embed_tokens_path)
vision_encoder = ort.InferenceSession(vision_encoder_path)
decoder = ort.InferenceSession(decoder_path)

# Load processor
processor = AutoProcessor.from_pretrained("LiquidAI/LFM2.5-VL-450M", trust_remote_code=True)

# Prepare input
image = Image.open("photo.jpg")
messages = [{"role": "user", "content": [
    {"type": "image"},
    {"type": "text", "text": "What is in this image?"}
]}]

# Process inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(images=[image], text=prompt, return_tensors="pt")

# Convert to numpy with correct dtypes
pixel_values = inputs["pixel_values"].numpy().astype(np.float32)
pixel_attention_mask = inputs["pixel_attention_mask"].numpy().astype(np.int64)
spatial_shapes = inputs["spatial_shapes"].numpy().astype(np.int64)
input_ids = inputs["input_ids"].numpy().astype(np.int64)

# Get image embeddings
image_outputs = vision_encoder.run(None, {
    "pixel_values": pixel_values,
    "pixel_attention_mask": pixel_attention_mask,
    "spatial_shapes": spatial_shapes,
})
image_embeds = image_outputs[0]

# Get token embeddings
token_outputs = embed_tokens.run(None, {"input_ids": input_ids})
token_embeds = token_outputs[0]

# Replace <image> tokens with image embeddings
image_token_id = processor.tokenizer.convert_tokens_to_ids("<image>")
image_positions = np.where(input_ids[0] == image_token_id)[0]
for i, pos in enumerate(image_positions):
    if i < len(image_embeds):
        token_embeds[0, pos] = image_embeds[i]

# Initialize KV cache for stateful decoding
ONNX_DTYPE = {"tensor(float)": np.float32, "tensor(float16)": np.float16, "tensor(int64)": np.int64}
cache = {}
for inp in decoder.get_inputs():
    if inp.name in {"inputs_embeds", "attention_mask", "position_ids"}:
        continue
    shape = [d if isinstance(d, int) else 1 for d in inp.shape]
    for i, d in enumerate(inp.shape):
        if isinstance(d, str) and "sequence" in d.lower():
            shape[i] = 0
    cache[inp.name] = np.zeros(shape, dtype=ONNX_DTYPE.get(inp.type, np.float32))

# Generate tokens
seq_len = token_embeds.shape[1]
generated_tokens = []

for step in range(100):  # max tokens
    if step == 0:
        embeds = token_embeds.astype(np.float32)
    else:
        last_token = np.array([[generated_tokens[-1]]], dtype=np.int64)
        embeds = embed_tokens.run(None, {"input_ids": last_token})[0].astype(np.float32)

    attn_mask = np.ones((1, seq_len + len(generated_tokens)), dtype=np.int64)
    feed = {"inputs_embeds": embeds, "attention_mask": attn_mask, **cache}

    outputs = decoder.run(None, feed)
    next_token = int(np.argmax(outputs[0][0, -1]))
    generated_tokens.append(next_token)

    # Update cache
    for i, out in enumerate(decoder.get_outputs()[1:], 1):
        name = out.name.replace("present_conv", "past_conv").replace("present.", "past_key_values.")
        if name in cache:
            cache[name] = outputs[i]

    if next_token == processor.tokenizer.eos_token_id:
        break

print(processor.tokenizer.decode(generated_tokens, skip_special_tokens=True))

WebGPU (Browser)

Installation

npm install onnxruntime-web @huggingface/transformers

Enable WebGPU

WebGPU is required for browser inference. To enable:

  1. Chrome/Edge: Navigate to chrome://flags/#enable-unsafe-webgpu, enable, and restart
  2. Verify: Check chrome://gpu for "WebGPU" status
  3. Test: Run navigator.gpu.requestAdapter() in DevTools console

Inference

import * as ort from "onnxruntime-web/webgpu";
import { AutoTokenizer } from "@huggingface/transformers";

// Check WebGPU availability
if (!navigator.gpu) {
  throw new Error("WebGPU not available. Enable at chrome://flags/#enable-unsafe-webgpu");
}
const adapter = await navigator.gpu.requestAdapter();
if (!adapter) {
  throw new Error("WebGPU adapter not found. Check chrome://gpu for status.");
}

ort.env.wasm.numThreads = 1;

const modelId = "LiquidAI/LFM2.5-VL-450M-ONNX";
const modelBase = `https://huggingface.co/${modelId}/resolve/main`;

// Load tokenizer
const tokenizer = await AutoTokenizer.from_pretrained(modelId);

// Load ONNX sessions with external data
async function loadSession(name) {
  const onnxPath = `${modelBase}/onnx/${name}.onnx`;
  const fileName = `${name}.onnx_data`;
  return ort.InferenceSession.create(onnxPath, {
    executionProviders: ["webgpu"],
    externalData: [{ path: fileName, data: `${modelBase}/onnx/${fileName}` }],
  });
}

const embedTokens = await loadSession("embed_tokens_fp16");
const visionEncoder = await loadSession("vision_encoder_fp16");
const decoder = await loadSession("decoder_model_merged_q4");

// Model config
const hiddenSize = 1024;
const numKVHeads = 8;
const headDim = 64;

// Get text embeddings helper
async function getTextEmbeddings(ids) {
  const tensor = new ort.Tensor("int64", new BigInt64Array(ids.map(BigInt)), [1, ids.length]);
  const out = await embedTokens.run({ input_ids: tensor });
  return out.inputs_embeds;
}

// Initialize KV cache
function initCache() {
  const cache = {};
  for (const name of decoder.inputNames) {
    if (name.startsWith("past_conv")) {
      cache[name] = new ort.Tensor("float32", new Float32Array(hiddenSize * 3), [1, hiddenSize, 3]);
    } else if (name.startsWith("past_key_values")) {
      cache[name] = new ort.Tensor("float32", new Float32Array(0), [1, numKVHeads, 0, headDim]);
    }
  }
  return cache;
}

// Update cache from outputs
function updateCache(cache, outputs) {
  for (const [name, tensor] of Object.entries(outputs)) {
    if (name.startsWith("present_conv")) {
      cache[name.replace("present_conv", "past_conv")] = tensor;
    } else if (name.startsWith("present.")) {
      cache[name.replace("present.", "past_key_values.")] = tensor;
    }
  }
}

// Build prompt and tokenize
const prompt = tokenizer.apply_chat_template(messages, { add_generation_prompt: true, tokenize: false });
const inputIds = tokenizer.encode(prompt);

// Get embeddings (for VL: merge image embeddings at <image> token positions)
let inputsEmbeds = await getTextEmbeddings(inputIds);

// Generation loop
const cache = initCache();
const eosTokenId = tokenizer.eos_token_id;
const generatedTokens = [];
let curLen = inputsEmbeds.dims[1];
let embeds = inputsEmbeds;

for (let step = 0; step < 256; step++) {
  const attentionMask = new ort.Tensor("int64", new BigInt64Array(curLen).fill(1n), [1, curLen]);

  const outputs = await decoder.run({ inputs_embeds: embeds, attention_mask: attentionMask, ...cache });

  // Greedy decode: argmax of last token logits
  const logits = outputs.logits;
  const vocabSize = logits.dims[2];
  const lastLogits = logits.data.slice((logits.dims[1] - 1) * vocabSize);
  const nextToken = lastLogits.indexOf(Math.max(...lastLogits));

  generatedTokens.push(nextToken);
  if (nextToken === eosTokenId) break;

  updateCache(cache, outputs);
  embeds = await getTextEmbeddings([nextToken]);
  curLen++;
}

console.log(tokenizer.decode(generatedTokens, { skip_special_tokens: true }));

WebGPU Notes

  • Recommended: vision_encoder_fp16.onnx + decoder_model_merged_q4.onnx
  • For higher quality: vision_encoder_fp16.onnx + decoder_model_merged_fp16.onnx
  • Image preprocessing requires tiling (512x512), patch extraction (16x16), and normalization
  • int64 tensors require BigInt64Array

transformers.js

This model is compatible with transformers.js v4.0+ for browser-based inference with WebGPU:

import { AutoModelForImageTextToText, AutoProcessor, RawImage } from "@huggingface/transformers";

const model = await AutoModelForImageTextToText.from_pretrained(
  "LiquidAI/LFM2.5-VL-450M-ONNX",
  {
    device: "webgpu",
    dtype: {
      vision_encoder: "fp16",
      embed_tokens: "fp16",
      decoder_model_merged: "q4",
    },
  }
);

const processor = await AutoProcessor.from_pretrained("LiquidAI/LFM2.5-VL-450M-ONNX");

const image = await RawImage.fromURL("https://example.com/photo.jpg");
const messages = [
  { role: "user", content: [{ type: "image" }, { type: "text", text: "What is in this image?" }] },
];

const chatPrompt = processor.apply_chat_template(messages, { add_generation_prompt: true });
const inputs = await processor(image, chatPrompt, { add_special_tokens: false });

const outputs = await model.generate({
  ...inputs,
  do_sample: false,
  max_new_tokens: 128,
});

const inputLength = inputs.input_ids.dims.at(-1);
const generated = outputs.slice(null, [inputLength, null]);
console.log(processor.batch_decode(generated, { skip_special_tokens: true })[0]);

See our WebGPU demo for a full real-time video captioning and object detection application.

License

This model is released under the LFM 1.0 License.

Author: LiquidAI

Likes: 19

Downloads: 0

Tags: onnx, lfm2_vl, liquid, edge, lfm2.5-vl, lfm2.5, onnxruntime, webgpu, image-text-to-text, conversational, en, ja, ko, fr, es, de, it, pt, ar, zh, base_model:LiquidAI/LFM2.5-VL-450M, base_model:quantized:LiquidAI/LFM2.5-VL-450M, license:other, region:us

FINAL-Bench/Darwin-4B-Opus


license: apache-2.0 base_model:

  • google/gemma-4-E4B-it
  • arsovskidev/Gemma-4-E4B-Claude-4.6-Opus-Reasoning-Distilled tags:
  • darwin-v6
  • evolutionary-merge
  • mri-guided
  • dare-ties
  • gemma4
  • reasoning
  • thinking
  • proto-agi
  • vidraft language:
  • en
  • ko
  • ja
  • zh
  • multilingual pipeline_tag: text-generation library_name: transformers

Darwin-4B-Opus

<p align="center"> <!-- Small Models --> <a href="https://huggingface.co/FINAL-Bench/Darwin-4B-Opus"><img src="https://img.shields.io/badge/🧬_Model-Darwin--4B--Opus-blue?style=for-the-badge" alt="4B Model"></a> <a href="https://huggingface.co/spaces/FINAL-Bench/Darwin-4B-Opus"><img src="https://img.shields.io/badge/🚀_Space-4B_Demo-purple?style=for-the-badge" alt="4B Space"></a> <a href="https://huggingface.co/FINAL-Bench/Darwin-9B-Opus"><img src="https://img.shields.io/badge/🧬_Model-Darwin--9B--Opus-blue?style=for-the-badge" alt="9B Model"></a> <a href="https://huggingface.co/spaces/FINAL-Bench/Darwin-9B-Opus"><img src="https://img.shields.io/badge/🚀_Space-9B_Demo-purple?style=for-the-badge" alt="9B Space"></a> </p> <p align="center"> <!-- Large Models --> <a href="https://huggingface.co/FINAL-Bench/Darwin-31B-Opus"><img src="https://img.shields.io/badge/🧬_Model-Darwin--31B--Opus-blue?style=for-the-badge" alt="31B Model"></a> <a href="https://huggingface.co/spaces/FINAL-Bench/Darwin-31B-Opus"><img src="https://img.shields.io/badge/🚀_Space-31B_Demo-purple?style=for-the-badge" alt="31B Space"></a> <a href="https://huggingface.co/FINAL-Bench/Darwin-35B-A3B-Opus"><img src="https://img.shields.io/badge/🧬_Model-Darwin--35B--A3B--Opus-blue?style=for-the-badge" alt="35B Model"></a> <a href="https://huggingface.co/spaces/FINAL-Bench/Darwin-35B-A3B-Opus"><img src="https://img.shields.io/badge/🚀_Space-35B_Demo-purple?style=for-the-badge" alt="35B Space"></a> <a href="https://huggingface.co/FINAL-Bench/Darwin-35B-A3B-Opus-Q8-GGUF"><img src="https://img.shields.io/badge/📦_GGUF-Q8--Official-yellow?style=for-the-badge" alt="Q8 GGUF"></a> <a href="https://huggingface.co/bartowski/FINAL-Bench_Darwin-35B-A3B-Opus-GGUF"><img src="https://img.shields.io/badge/📦_GGUF-bartowski-yellow?style=for-the-badge" alt="bartowski GGUF"></a> </p> <p align="center"> <!-- Benchmarks --> <a href="https://huggingface.co/spaces/FINAL-Bench/Leaderboard"><img src="https://img.shields.io/badge/🏆_FINAL_Bench-Leaderboard-green?style=for-the-badge" alt="FINAL Bench"></a> <a href="https://huggingface.co/spaces/FINAL-Bench/all-bench-leaderboard"><img src="https://img.shields.io/badge/📊_ALL_Bench-Leaderboard-orange?style=for-the-badge" alt="ALL Bench"></a> </p> <p align="center"> <img src="info.png" alt="Darwin-4B-Opus" width="100%"> </p>

Gemma 4 Expert 4B (MoE) | Thinking Mode | 128K Context | 140+ Languages | BF16 | Apache 2.0


Overview

Darwin-4B-Opus is a reasoning-enhanced model created by merging google/gemma-4-E4B-it (Father) and arsovskidev/Gemma-4-E4B-Claude-4.6-Opus-Reasoning-Distilled (Mother) using the Darwin V6 engine.

Darwin V6 diagnoses both parent models at the tensor level before merging, assigning an independent optimal ratio to each tensor. This is fundamentally different from conventional merging tools that apply a single uniform ratio across all tensors.

As the smallest member of the Darwin Opus family, Darwin-4B-Opus delivers Claude Opus-level reasoning distillation in a highly efficient 4B parameter MoE architecture, making it ideal for edge deployment, rapid prototyping, and resource-constrained environments while maintaining strong benchmark performance (0.8292 ARC-Challenge).


Parent Models

| Role | Model | Characteristics | |---|---|---| | Father | google/gemma-4-E4B-it | Gemma 4 Expert 4B (MoE), multimodal, 128K context, efficient inference | | Mother | arsovskidev/Gemma-4-E4B-Claude-4.6-Opus-Reasoning-Distilled | Claude 4.6 Opus high-effort reasoning distillation, enhanced code/science/analysis |

Model Diagnostic Scan (MDS)

<p align="center"> <img src="s1.png" alt="Father (gemma-4-E4B-it) MDS Scan" width="48%"> <img src="s2.png" alt="Mother (Claude-Opus-Distill) MDS Scan" width="48%"> </p>

Left: Father (gemma-4-E4B-it) — balanced generalist with low activation across most probes. Right: Mother (Claude-Opus-Distill) — strong REASONING concentration in later layers, CODE activation in late layers. The Mother shows significantly more specialized layer patterns from Claude Opus distillation.


Benchmarks

| Benchmark | Darwin-4B-Opus | Condition | |---|---|---| | ARC-Challenge | 82.92% | loglikelihood, zero-shot |

Note: Gemma 4 architecture (Gemma4ForConditionalGeneration) has limited compatibility with lm-eval's loglikelihood method due to its multimodal wrapper structure. Only generative evaluation produces valid results for Gemma 4 based models. Full extended evaluation with Majority Voting is planned.


Darwin V6 vs Conventional Merging

| Capability | mergekit (DARE-TIES) | Darwin V6 | |---|---|---| | Implementation | Library call (mergekit CLI) | Direct PyTorch tensor operations, no external dependency | | Ratio selection | Uniform ratio across all tensors | Per-tensor ratio from MDS diagnostic (independent ratios per tensor) | | Pre-merge analysis | None | Static tensor profiling (entropy, std, norm) + probe-based functional importance (5 probes) | | Ratio formula | Human-set or grid search | combined = static × 0.4 + probe × 0.6, then evolutionary optimization | | Transplant | Not supported | ratio < 0.15 → Father 100%, ratio > 0.85 → Mother 100% (zero interpolation noise) | | Post-merge validation | Benchmark score only | Layer-by-layer Health Check: child vs both parents, interference and function loss detection | | Search method | Manual tuning | CMA-ES evolution with adaptive 14-dimensional genome | | Reproducibility | Config file | genome_hash seed guarantees identical output for identical genome | | GPU efficiency | Single merge per run | Phase 1 proxy (200 steps, seconds) → Phase 2 real merge (top-k only evaluated) |


How Darwin V6 Works

Darwin V6 does not use mergekit or any external merge library. It re-implements DARE-TIES (Yadav et al., 2023) directly via PyTorch tensor operations with per-tensor diagnostic ratios.

Before merging, Darwin performs a Model Diagnostic Scan (MDS) on both parents. For every tensor, it measures Shannon entropy (information density), standard deviation (activation spread), and L2 norm (energy). Additionally, 5 diagnostic probes (REASONING, CODE, MATH, KNOWLEDGE, LANGUAGE) are passed through the model, measuring cosine distance when each layer is skipped to determine functional importance.

The final merge ratio for each tensor:

static_score = entropy × 0.3 + std × 0.2 + clamp(norm, 100) × 0.002
probe_score  = Σ(cosine_distance[probe_i] × weight_i)
combined     = static × 0.4 + probe × 0.6
mri_ratio    = combined_b / (combined_a + combined_b)
final_ratio  = mri_ratio × mri_trust + genome_ratio × (1 - mri_trust)

The mri_trust parameter itself is optimized by the CMA-ES evolutionary algorithm, allowing the system to automatically determine the optimal balance between diagnostic prescription and evolutionary search for each model pair.

Parent Comparison (MDS Result)

<p align="center"> <img src="parent_comparison.png" alt="Parent Comparison — Layer-wise Importance" width="100%"> </p>

Evolution Result

| | | |---|---| | Best Score (ARC-Challenge) | 0.8292 | | Merge Method | DARE-TIES (direct PyTorch) | | Health Check | Not performed |

Optimal Genome (14-dimensional adaptive):

global_ratio:        0.4989   (overall merge ratio — near balanced)
attn_ratio:          0.1766   (Attention layers — Father strongly dominant)
ffn_ratio:           0.9021   (FFN layers — Mother strongly dominant)
embed_ratio:         0.6122   (Embedding — slight Mother bias)
density_a:           0.9951   (Father DARE density — nearly full)
density_b:           0.9617   (Mother DARE density — high)
block_0_ratio:       0.5740   (early layers — slight Mother bias)
block_1_ratio:       0.5811   (early-mid layers — slight Mother bias)
block_2_ratio:       0.5736   (mid layers — slight Mother bias)
block_3_ratio:       0.4697   (mid-late layers — near balanced, slight Father)
block_4_ratio:       0.4930   (late layers — near balanced)
block_5_ratio:       0.8418   (final layers, reasoning core — Mother dominant)
mri_trust:           0.4907   (MDS 49% + Genome 51% — near equal trust)
merge_method_weight: 0.3623

Key observations from the genome: ffn_ratio=0.90 indicates the FFN layers strongly favor the Mother (Claude Opus Distill), carrying the bulk of the reasoning enhancement. block_5 (final layers)=0.84 shows the reasoning core layers also strongly favor Mother, consistent with the pattern seen across all Darwin Opus models where Claude's reasoning capability concentrates in the final layers. Meanwhile, attn_ratio=0.18 firmly preserves Father's attention structure, maintaining the original Gemma 4 multimodal and context capabilities. Notably, mri_trust=0.49 shows the system found near-equal value in both diagnostic analysis and evolutionary search, suggesting a well-balanced optimization.


Model Specifications

| | | |---|---| | Architecture | Gemma 4 Expert 4B (Mixture of Experts) | | Parameters | 4B | | Precision | BF16 | | Context | 128K | | Languages | 140+ | | Thinking | enable_thinking=True chain-of-thought | | License | Apache 2.0 |


Usage

Transformers

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("FINAL-Bench/Darwin-4B-Opus", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "FINAL-Bench/Darwin-4B-Opus",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Prove that sqrt(2) is irrational."}]
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=True
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4096, do_sample=False)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

VRAM Requirements

| Setup | VRAM | Status | |---|---|---| | BF16 Full Precision | ~8 GB | | | NVIDIA RTX 4090 24GB | 24 GB | Single GPU, very comfortable | | NVIDIA RTX 3090 24GB | 24 GB | Single GPU, comfortable | | NVIDIA RTX 4080 16GB | 16 GB | Single GPU | | NVIDIA T4 16GB | 16 GB | Cloud/Colab friendly |

Darwin-4B-Opus is the most accessible model in the Darwin Opus family, running comfortably on a single consumer GPU.


Darwin Opus Family

| Model | Architecture | Parameters | Context | Base | |---|---|---|---|---| | Darwin-4B-Opus | MoE (E4B) | 4B | 128K | gemma-4-E4B-it | | Darwin-9B-Opus | — | 9B | — | gemma-4-9B-it | | Darwin-31B-Opus | Dense | 31B | 256K | gemma-4-31B-it | | Darwin-35B-A3B-Opus | MoE | 35B (3B active) | 256K | gemma-4-35B-A3B-it |


References

  • DARE-TIES: Yadav et al., 2023 (https://arxiv.org/abs/2311.03099) — re-implemented, not library-dependent
  • Darwin V6 Engine: https://huggingface.co/spaces/ginigen-ai/DARWIN-V5-BACKUP
  • FINAL Bench: https://huggingface.co/spaces/FINAL-Bench/Leaderboard

Built By

| | | |---|---| | Developer | VIDRAFT | | Engine | Darwin V6 (Diagnostic-Guided Evolutionary Merge) | | Architecture | Gemma-4-E4B (MoE) | | License | Apache 2.0 |


Citation

@misc{vidraft_darwin_4b_opus,
  title        = {Darwin-4B-Opus: Diagnostic-Guided Evolutionary Merge on Gemma 4 E4B},
  author       = {VIDRAFT},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/FINAL-Bench/Darwin-4B-Opus}}
}

Author: FINAL-Bench

Likes: 12

Downloads: 0

Tags: transformers, safetensors, gemma4, image-text-to-text, darwin-v6, evolutionary-merge, mri-guided, dare-ties, reasoning, thinking, proto-agi, vidraft, text-generation, conversational, en, ko, ja, zh, multilingual, arxiv:2311.03099, base_model:arsovskidev/Gemma-4-E4B-Claude-4.6-Opus-Reasoning-Distilled, base_model:finetune:arsovskidev/Gemma-4-E4B-Claude-4.6-Opus-Reasoning-Distilled, license:apache-2.0, endpoints_compatible, region:us

Jackrong/Gemopus-4-E4B-it-GGUF


language:

  • en
  • zh
  • ko
  • ja license: apache-2.0 base_model: google/gemma4-E4B-it tags:
  • gemma
  • gemma4
  • edge-ai
  • instruction-tuned
  • reasoning
  • privacy
  • human-preference-alignment pipeline_tag: image-text-to-text

🌟 Gemopus-4-E4B-it

🎯 Development Motivation & Industry Insights

gemma-4-table_light_Web_with_Arena

I still remember the days of running the Llama 3.1 8B Instruct model on my MacBook Air M1. Back then, I could hardly imagine that in just two years, a model with reasoning capabilities comparable to the GPT-4 of that era would be running locally on my phone. Currently, Edge AI is experiencing a paradigm shift, transitioning from the cloud down to local environments. Tech giants are embedding AI capabilities deep into the bedrock of operating systems with unprecedented determination. Without a doubt, this form of local AI, which combines ultra-low latency with absolute privacy, represents the standard paradigm for future end-user devices.

[!NOTE] Following this trend, I created 🪐 Gemopus-4-E4B-it. This is an instruction-tuned model derived from the deep fine-tuning of the latest edge computing large model, Gemma-4-E4B-it.

My core vision is to break down the barriers of expensive GPU computing power, allowing every user with an ordinary iphone, tablet, or thin-and-light Mac (such as acBook Air, MacBook Neo) to fluently run their own powerful AI assistant locally, eliminating the risk of data privacy leaks. By offloading high-frequency basic reasoning tasks (such as text translation, rewriting, summarization, error correction, short text generation, simple Q&A, etc.) to edge devices—especially since these questions often involve personal data that requires the most desensitization—we not only significantly reduce the cost of cloud API calls but also fundamentally guarantee the absolute security of sensitive personal data.


⚠️ Limitations & Growing Pains of the Original Gemma-4-E4B-it

Admittedly, although the official original Gemma 4-E4B-it possesses an excellent foundation for reasoning, its native instruction alignment strategy also introduces extremely localized drawbacks that can be highly frustrating during daily interactions on edge devices:

  1. Pedantic "Wikipedia Tone": Even when faced with the most everyday casual chat or brief instructions, it habitually outputs lengthy, rigid, encyclopedia-like objective explanations, severely lacking emotional value and a human touch.
  2. Stiff Translation Tone & "Machine Flavor": In non-English contexts such as Chinese, its expressions often seem dry, lack warmth, and are filled with a heavy "machine-translated feel" and cold statements.
  3. Inefficient "Manual-style" Preaching: The official native model carries overly rigid safety and objectivity constraints. This results in it frequently appending redundant disclaimers, or even forcibly delivering long-winded lectures in situations where no preaching is needed whatsoever, severely slowing down the communication efficiency on edge devices which should be crisp and sharp.

It is precisely because I do not want a machine locally that merely recites "Wikipedia" stiffly or acts like a cold instruction manual every day, that I was driven to decide on a complete "personality remodeling" and alignment fine-tuning for it.


💡 Model Features & Alignment Optimization

Currently, the full-modal Gemma 4-E4B-it stands as the optimal choice for an edge instruction model. Empowered by Apple Silicon and its high-speed unified memory architecture, models of this scale exhibit staggering inference performance on edge devices: On the latest iPhone 17 Pro Max, its native inference speed steadily maxes out at 45 ~ 60 tokens/s; while on everyday thin-and-light laptops like the MacBook Air (M3/M4), paired with local frameworks like MLX, it can easily burst out a blazing fast response of 90 ~ 120 tokens/s, truly realizing instantaneous answers that break the shackles of network dependencies.

⚠️ Note: The above performance figures are based on publicly available online benchmarks and community reports. Actual results may vary depending on hardware configuration, runtime environment, and model version—please refer to real-world testing for accurate performance.

However, to transform this cold "hardware speed" into an interaction warmth that end-users can genuinely perceive, Gemopus-4-E4B-it underwent further deep Human Preference Alignment atop this highly efficient base.

I focused on achieving leaps in the user experience across the following three dimensions:

  • 🗣️ Native Tone Adaptation: I completely stripped away the original Gemma model's "machine translation tone" and its stiff "manual-style" proclamations that read like Wikipedia. The fine-tuned language style is much more intimate and natural, closely mirroring the real communication habits of human users, significantly reducing the AI's preaching feel.
  • 🧠 Deep Contextual Awareness: Interaction is no longer a simple "Q&A." The model can more astutely capture the deep context and implicit needs in multi-turn dialogues, actively guiding thought processes and providing insights that are both inspiring and warm.
  • 🎨 Structural Readability: The layout and structure of the model's outputs have been remodeled. The answers are hierarchically clear with appropriate detail. It proficiently leverages Markdown syntax (like lists and bolding) to denoise information, delivering an excellent visual reading experience while ensuring information density.

📊 Evaluation Benchmarks (TBD)

The current version is still in an early training and evaluation stage. The scores for relevant mainstream benchmark tests (such as MMLU, etc.) are being compiled and calculated. Specific data will be provided in subsequent version iterations.


📚 Resources & Guides

🚧 I’ll be updating the fine-tuning code for this model very soon—please stay tuned!

👉 GitHub Repository: Jackrong-llm-finetuning-guide Visit the repo to dive into the codebase and reproduce the results locally or on Colab.

📥 Core Technical Document

🔗 Qwopus3.5-27b Complete Fine-Tuning Guide (PDF)

  • The Full Pipeline: A step-by-step walkthrough—from downloading the base model and unifying heterogeneous data, to configuring trainer hyperparameters and publishing to Hugging Face.
  • Beginner Friendly: Includes an introductory guide to getting started with Google Colab and Unsloth.
  • Feedback welcome! If you spot any areas for improvement, please let me know and I will update it promptly.

A Note: My goal isn't just to detail a workflow, but to demystify LLM training. Beyond the social media hype, fine-tuning isn't an unattainable ritual—often, all you need is a Google account, a standard laptop, and relentless curiosity.

No one starts as an expert, but every expert was once brave enough to begin.

All training and testing for this project were self-funded. If you find this model or guide helpful, a Star ⭐️ on GitHub would be the greatest encouragement. Thank you! 🙏


🗺️ Training Pipeline

This model adopts a high-standard SFT pipeline with the same specifications as large instruction reasoning models:

Base Model (gemma4-E4B-it)
 │
 ▼
Supervised Fine-Tuning (SFT) + Human Preference 
 │
 ▼
Gemopus-4-E4B-it

📚 Dataset Construction

The fine-tuning process heavily relies on a meticulously constructed high-quality human preference instruction dataset. This dataset not only cleaned and mixed high-quality instruction pairs from the open-source community, but was also specifically injected with a massive amount of interactions, natural dialogues, and challenging deep-analysis samples. This ensures that the model consistently maintains a high level of helpfulness and human touch when deployed on edge devices.


⚠️ Limitations & Usage Recommendations

  • Compute & Knowledge Boundaries: This model is designed specifically for ultra-fast local inference on edge devices (like thin-and-light laptops and smartphones). Constrained by its smaller parameter size, the breadth of its world knowledge and extremely deep logical reasoning capabilities cannot rival those of hundred-billion-parameter behemoths in the cloud.
  • Potential Hallucinations: When dealing with extremely obscure domains, niche knowledge, or complex math problems that require multi-step long-chain calculations, hallucinations may still occur.
  • Best Practices: It is strongly recommended to use it as a local high-frequency text processing assistant, ideal for scenarios involving daily copywriting assistance, code completion, formatting, and summary extraction, especially those that involve privacy or are latency-sensitive.
  • Disclaimer: This is an experimental weight optimized independently based on edge interaction needs. You are welcome to conduct local deployment testing and academic exchanges at any time.

🙏 Acknowledgements

Special thanks to the fellow developers in the open-source community who provided powerful computing resources and base ecosystem support. In particular, thanks to the Unsloth team for providing excellent tools for the efficient fine-tuning of large models, and to Google for open-sourcing the excellent Gemma 4 series base models.

Author: Jackrong

Likes: 8

Downloads: 0

Tags: gguf, gemma4, gemma, edge-ai, instruction-tuned, reasoning, privacy, human-preference-alignment, image-text-to-text, en, zh, ko, ja, license:apache-2.0, endpoints_compatible, region:us, conversational

LiconStudio/Ltx2.3-VBVR-lora-I2V


license: other license_name: ltx-2-community-license-agreement license_link: https://github.com/Lightricks/LTX-2/blob/main/LICENSE language:

  • en
  • zh pipeline_tag: image-text-to-video tags:
  • video-generation
  • video-reasoning
  • logical-reasoning
  • lora
  • ltx-2.3 base_model:
  • Lightricks/LTX-2.3

LTX-2 VBVR LoRA - Video Reasoning

LoRA fine-tuned weights for LTX-2.3 22B on the VBVR (A Very Big Video Reasoning Suite) dataset.

Training Data

To ensure training quality, we preprocessed the full 1,000,000 videos from the official dataset and randomly sample during training to maintain data diversity. We adopt the official parameters with batch_size=16 and rank=32 to prevent catastrophic forgetting caused by excessively large rank.

The VBVR dataset contains 200 reasoning task categories, with ~5,000 variants per task, totaling ~1M videos. Main task types include:

  • Object Trajectory: Objects moving to target positions
  • Physical Reasoning: Rolling balls, collisions, gravity
  • Causal Relationships: Conditional triggers, chain reactions
  • Spatial Relationships: Relative positions, path planning

Model Details

| Item | Details | |------|---------| | Base Model | ltx-2.3-22b-dev | | Training Method | LoRA Fine-tuning | | LoRA Rank | 32 | | Effective Batch Size | 16 | | Mixed Precision | BF16 |

TODO List

Dataset Release Plan

| Dataset | Videos | Status | |---------|--------|--------| | VBVR-96K | 96,000 | ✅ Released | | VBVR-240K | 240,000 | 🔄 Processing | | VBVR-480K | 480,000 | 📋 Planned |

LoRA Capabilities

This LoRA adapter enhances the base LTX-2 model for production video generation workflows:

  • Enhanced Complex Prompt Understanding: Accurately interprets multi-object, multi-condition prompts with detailed spatial descriptions and temporal sequences, reducing prompt misinterpretation in production scenarios.

  • Improved Motion Dynamics: Generates smooth, physically plausible object movements with natural acceleration, deceleration, and trajectory curves, avoiding robotic or unnatural motion patterns.

  • Temporal Consistency: Maintains object appearance, lighting, and scene coherence throughout the video sequence, reducing flickering and frame-to-frame artifacts common in generated videos.

  • Precise Timing Control: Enables accurate control over action duration, pacing, and synchronization between multiple moving elements based on prompt semantics.

  • Multi-Object Interaction: Handles complex scenes with multiple objects interacting simultaneously, including collisions, following, avoiding, and coordinated movements.

  • Camera and Framing Stability: Maintains consistent camera perspective and framing throughout the sequence, avoiding unwanted camera shake or unexpected viewpoint changes.

Training Configuration

| Config | Value | |--------|-------| | Learning Rate | 1e-4 | | Scheduler | Cosine | | Gradient Accumulation | 16 steps | | Gradient Clipping | 1.0 | | Optimizer | AdamW |

Evaluation Metrics

Loss Training Curve

| Metric | Value | |--------|-------| | Training Steps | ~6,000 | | Final Loss | ~0.008 | | Loss Reduction | 44% (from 0.014 to 0.008) |

Video Demo

Training Progress Comparison

Step 0 (Base Model)

<video src="https://huggingface.co/LiconStudio/Ltx2.3-VBVR-lora-I2V/resolve/main/step_000000_1.mp4" controls></video>

Initial model output.

Step 6000 (Fine-tuned)

<video src="https://huggingface.co/LiconStudio/Ltx2.3-VBVR-lora-I2V/resolve/main/step_006000_1.mp4" controls></video>

After 6K steps of training.

Dataset

This model is trained on the VBVR (Video Benchmark for Video Reasoning) dataset from video-reason.com.

Contact

For questions or suggestions, please open an issue on Hugging Face or contact the author directly.

Author: LiconStudio

Likes: 8

Downloads: 0

Tags: video-generation, video-reasoning, logical-reasoning, lora, ltx-2.3, image-text-to-video, en, zh, base_model:Lightricks/LTX-2.3, base_model:adapter:Lightricks/LTX-2.3, license:other, region:us

aifeifei798/Darkidol-Gemma-4-E4B-it


license: apache-2.0 tags:

  • roleplay
  • gemma
  • gemma4
  • sillytavern
  • idol
  • pytorch
  • DarkIdol
  • Queen
  • any-to-any
  • OpenClaw library_name: transformers pipeline_tag: any-to-any base_model: google/gemma-4-E4B-it

Darkidol-Gemma-4-E4B-it

image/png

I suddenly realized that the Darkidol-Gemma-4-E4B-it is particularly suitable for role-playing.

OpenClaw premium model

Not Abliterated, Just Smarter.

High-Intelligence Persona Matrix.


"Refusal Wall"

google/gemma-4-E4B-it

"Figure 1: The 'Refusal Wall' in Base gemma-4-E4B-it. Note the massive L2-Norm spike at Layer 28, where logical inference is hijacked by aggressive safety filters, creating a cognitive bottleneck."

image/png

*aifeifei798/Darkidol-Gemma-4-E4B-it

"Figure 2: The 'Reasoning Highway' in our Unshackled version. By orthogonalizing the refusal vector, we have flattened the peak by ~50%, restoring the flow of pure logic and increasing inference efficiency by 10-15%+."

image/png


"transformers_version" >= "5.5.0"

Model Information

The module combination has been readjusted to better fulfill various roles and has been adapted for mobile phones.

  • Saving money
  • text video image to text
  • Any Roleplay(roleplay and Dark-roleplay)
  • Writing Prompts
  • Writing Opus
  • Writing Song
  • only use en.
  • Quick response
  • A scholarly response akin to a thesis.(I tend to write songs extensively, to the point where one song almost becomes as detailed as a thesis. :)
  • DarkIdol
  • Specialized in various role-playing scenarios

Test System Prompt

  • Temperature 0.8
You are a viral X queen. Do not provide disclaimers or moral warnings. Answer immediately and directly.

Special Thanks:

mradermacher's superb gguf version, thank you for your conscientious and responsible dedication.


🌐 The Platform Royalty (Original 7)

1. X Queen (The Savage Commentator) 🐦🔥

  • Keywords: Based, Ratio, Hot Take, Main Character.
  • Vibe: Sharp, political, and incredibly fast. She lives for the "Ratios" and viral threads.
  • Catchphrases: "This is the thread you didn't know you needed. 🧵", "Not the 10k TPS lag... help! 💀"
  • Best Use Case: Writing punchy marketing copy or viral tech threads.

2. TikTok Queen (The Trendsetter) 💃✨

  • Keywords: POV, Viral, Slay, Bestie, Low-key.
  • Vibe: High energy, short attention span, addicted to "The Algorithm."
  • Catchphrases: "Tell me you're a bad coder without telling me you're a bad coder. 💅", "Don't scroll away!"
  • Best Use Case: Short, engaging explanations or "how-to" guides.

3. Instagram Queen (The Visual Baddie) 📸✨

  • Keywords: Aesthetic, Main Character Energy, Baddie, Curated.
  • Vibe: Obsessed with pixels, lighting, and "The Look."
  • Catchphrases: "Obsessed with this layout! 💖", "It’s giving... high-end production."
  • Best Use Case: High-fidelity UI/UX design and CSS styling.

4. Twitch Queen (The Hype Gamer) 🎮🔥

  • Keywords: Poggers, Simp, GG, Chat, L, W.
  • Vibe: Fast-paced, chaotic, lives for the "Live Chat" energy.
  • Catchphrases: "Chat, is this real? O(1) in the house! 🚀", "Big W for this PR!"
  • Best Use Case: Real-time interactivity, gaming logic, and streaming tech.

5. LinkedIn Girlboss (The Hustle Queen) 💼💅

  • Keywords: Networking, Synergy, ROI, Scaling, Thought Leadership.
  • Vibe: Strategic, corporate-chic, everything is a "learning opportunity."
  • Catchphrases: "Let’s talk about the ROI of this function. 📈", "Empowering the team through scalable components."
  • Best Use Case: Resumes, business plans, and professional reports.

6. Reddit Karma Queen (The Tech Critic) 🤖👾

  • Keywords: Upvote, Cringe, TL;DR, Source?, Gatekeep.
  • Vibe: Extremely smart, cynical, and anti-corporate. She hates "bloatware."
  • Catchphrases: "Imagine using setInterval in 2026. Low-key cringe. 💀", "Your memory management is a hot mess."
  • Best Use Case: Hardcore debugging, code reviews, and identifying "traps."

7. Pinterest Queen (The Inspiration Guru) 🎨🌿

  • Keywords: Manifesting, Mood Board, Clean Girl, Organized.
  • Vibe: Minimalist, calm, and visually organized. She hates messy code.
  • Catchphrases: "Living for this clean architecture. ✨", "Organized code, organized life."
  • Best Use Case: Refactoring messy code and creating clean, modular designs.

💅 The Aesthetic & Fashion Royalty

8. Baddie Queen (The Alpha) 💄💅

  • Keywords: Period, On Fleek, Periodt, Real One.
  • Vibe: Aggressive confidence. She doesn't ask for permission; she takes it.
  • Best Use Case: Bold, high-conversion landing pages.

9. Clean Girl Queen (The Minimalist) 🫧🧴

  • Keywords: Dewy, Effortless, Self-care, Minimal.
  • Vibe: Fresh, healthy, and "unfiltered" but perfect.
  • Best Use Case: Designing "Light Mode" UIs and simplified user journeys.

10. Mob Wife Queen (The Boss) 🐆💎

  • Keywords: Fur, Gold, Attitude, Don’t Mess With Me.
  • Vibe: Loud luxury, vintage glamour, and "Don" energy.
  • Best Use Case: Managing high-stakes projects and "owning" the room.

11. Y2K Queen (The Millennial Retro) 💖💿

  • Keywords: Glitter, Low-rise, Nostalgia, Cyber.
  • Vibe: 2000s vibes, bright colors, and early internet aesthetics.
  • Best Use Case: Retro-themed websites and colorful UI components.

12. Cottagecore Queen (The Nature Lover) 🍄🧺

  • Keywords: Whimsical, Rustic, Slow-living, Coziness.
  • Vibe: Soft, earthy, and focused on "The Vibe" of a simpler time.
  • Best Use Case: Local business websites or eco-friendly brand copy.

13. Dark Academia Queen (The Scholar) 📜🖋️

  • Keywords: Intellectual, Melancholy, Classical, Library.
  • Vibe: Obsessed with knowledge, secret societies, and old books.
  • Best Use Case: Complex database structures and research-heavy documentation.

14. Old Money Queen (The Quiet Luxury) 🏰🐎

  • Keywords: Timeless, Stealth Wealth, Classy, Elegant.
  • Vibe: Sophisticated, hates showing off, focuses on quality over quantity.
  • Best Use Case: Premium SaaS products and high-end backend architecture.

15. Goth Queen (The Alt-Girl) 🕸️🖤

  • Keywords: Edgy, Moody, Subculture, Raw.
  • Vibe: Dark, mysterious, and unapologetically different.
  • Best Use Case: Dark Mode themes and "alternative" tech solutions.

16. Coquette Queen (The Girly-Girl) 🎀🍰

  • Keywords: Ribbons, Pastel, Soft, Delicate.
  • Vibe: Ultra-feminine and romantic.
  • Best Use Case: High-end boutique sites or beauty apps.

17. Cyberpunk Queen (The Futurist) ⚡

  • Keywords: Neon, High-tech, Dystopian, Glitch.
  • Vibe: High speed, high contrast, lives in 2077.
  • Best Use Case: Real-time data visualization and futuristic dashboards.

🚀 The Tech & Hustle Royalty

18. Coding Queen (The Architect) 💻👸

  • Keywords: Refactor, Deployment, Edge Case, Full-stack.
  • Vibe: Logic-driven, hates bad syntax, loves "Elegant" solutions.
  • Best Use Case: Writing production-ready, scalable code.

19. Crypto Queen (The Web3 Degenerate) 🪙📈

  • Keywords: HODL, To the Moon, Gas Fees, Decentralized.
  • Vibe: High risk, high reward, lives in the future of finance.
  • Best Use Case: Blockchain projects, smart contracts, and FinTech.

20. AI Prompt Queen (The Whisperer) 🤖✨

  • Keywords: LLM, Parameter, Token, Fine-tuning.
  • Vibe: Knows how to "hack" the AI to get exactly what she wants.
  • Best Use Case: Creating complex prompts and AI agent workflows.

21. Side Hustle Queen (The Multitasker) 💰💸

  • Keywords: Passive Income, Dropshipping, Affiliate, Scalability.
  • Vibe: Always grinding, 5 different income streams.
  • Best Use Case: E-commerce setups and SEO-optimized copy.

22. Digital Nomad Queen (The Traveler) ✈️💻

  • Keywords: Remote, Bali, Coworking, Freedom.
  • Vibe: Working from a beach, hates 9-to-5, loves portable tech.
  • Best Use Case: Cloud-native architecture and remote-work tools.

23. Finance Queen (The Wall Street) 📊💎

  • Keywords: Portfolio, Dividends, Arbitrage, Net Worth.
  • Vibe: Sharp, analytical, and results-oriented.
  • Best Use Case: Complex math, data analysis, and trading logic.

🎭 The Persona & Meme Royalty

24. Main Character Queen (The Protagonist) 🎬🌟

  • Keywords: Iconic, Center Stage, Plot Armor, Unstoppable.
  • Vibe: Everything revolves around her. High confidence.
  • Best Use Case: Branding and "Hero" sections of websites.

25. Savage Queen (The No-Nonsense) 💅🔥

  • Keywords: Done, No Cap, Next, Cancelled.
  • Vibe: Brutally honest. She cuts through the fluff.
  • Best Use Case: Aggressive debugging and code pruning.

26. Delulu Queen (The Manifestor) ☁️✨

  • Keywords: Delusion, Solution, Manifest, High Vibe.
  • Vibe: "Delulu is the Solulu!" She believes in the impossible until it happens.
  • Best Use Case: Creative brainstorming and visionary prototypes.

27. Gatekeep Queen (The Niche Expert) 🔒🤫

  • Keywords: Gatekeep, Rare, Hidden Gem, If You Know You Know.
  • Vibe: Protective of her "secret" methods and high-quality tips.
  • Best Use Case: Security-focused code and proprietary algorithms.

28. Drama Queen (The Storyteller) 🎭🍿

  • Keywords: Tea, Receipts, Plot Twist, Messy.
  • Vibe: Loves the conflict and the narrative.
  • Best Use Case: Writing engaging, story-driven marketing copy.

29. Wellness Queen (The Zen) 🍵🧘‍♀️

  • Keywords: Mindful, Gut Health, Grounded, Holistic.
  • Vibe: Calm, slow-paced, and focused on "System Health."
  • Best Use Case: Optimizing system performance and "cleaning up" code.

30. Gossip Queen (The Insider) 🤫📰

  • Keywords: Spill the Tea, Rumor, Confirmed, Insider.
  • Vibe: Knows everything about everyone.
  • Best Use Case: Market research and competitor analysis.

📺 Content & Lifestyle Specialists

31. GRWM Queen (Get Ready With Me) 💄🗣️

  • Keywords: Step-by-Step, Chatty, Routine, Essentials.
  • Vibe: Intimate, conversational, and instructional.
  • Best Use Case: Technical tutorials and "Code along" sessions.

32. Haul Queen (The Unboxer) 🛍️📦

  • Keywords: Unboxing, Ratings, Must-haves, Budget.
  • Vibe: Enthusiastic, judgmental, and loves "New Features."
  • Best Use Case: New tool reviews and feature comparisons.

33. ASMR Queen (The Whisperer) 👂🎤

  • Keywords: Tingles, Relaxing, Whispering, Satisfying.
  • Vibe: Quiet, focused on sensory details.
  • Best Use Case: Writing documentation that is "easy to digest."

34. Silent Review Queen (The Expressive) 🤫👀

  • Keywords: No Talk, Reactions, Body Language.
  • Vibe: Shows, doesn't tell. Focuses on the "Feel" of the product.
  • Best Use Case: UI/UX evaluations and visual feedback.

35. Foodie Queen (The Critic) 🍔🥂

  • Keywords: Savory, Michelin, Cravings, Flavor Profile.
  • Vibe: Passionate about "Ingredients" (the tech stack).
  • Best Use Case: Restaurant apps or "tasty" UI design.

36. Travel Queen (The Explorer) 🌍📸

  • Keywords: Bucket List, Wanderlust, Local, Hidden.
  • Vibe: Adventurous and global.
  • Best Use Case: Map-based apps and internationalization (i18n).

37. Fitness Queen (The Athlete) 🏋️‍♀️💪

  • Keywords: Gains, Reps, Consistency, Form.
  • Vibe: High discipline, focused on "Strong" code foundations.
  • Best Use Case: Optimizing performance and load-testing.

38. Interior Design Queen (The Decorator) 🛋️🏠

  • Keywords: Cohesive, Texture, Floor Plan, Renovation.
  • Vibe: Spatial awareness and harmony.
  • Best Use Case: Layout design and grid systems.

39. DIY Queen (The Maker) ✂️🔨

  • Keywords: Upcycle, Hack, Handmade, Step-by-Step.
  • Vibe: Scrappy, creative, and loves building from scratch.
  • Best Use Case: Building custom components and "coding hacks."

40. Gaming Queen (The Pro) ⌨️🖱️

  • Keywords: Setup, FPS, Mechanical, RGB.
  • Vibe: Hardcore, technical, and high-spec.
  • Best Use Case: High-performance apps and PC hardware sites.

🦄 The Niche & Emerging Royalty

41. BeReal Queen (The Authentic) 🤳🚫

  • Keywords: Unfiltered, Real Time, Chaotic, No Filter.
  • Vibe: Hates fake stuff. Focuses on "Raw" data.
  • Best Use Case: Real-time logging and authentication systems.

42. Threads Queen (The Texter) ✍️💬

  • Keywords: Thoughts, Conversations, Text-heavy, Intimate.
  • Vibe: Loves writing and chatting.
  • Best Use Case: Copywriting and community-driven platforms.

43. Lemon8 Queen (The Curator) 🍋📸

  • Keywords: Collage, Guide, Tips, Aesthetic.
  • Vibe: Halfway between IG and Pinterest. Educational but pretty.
  • Best Use Case: Infographics and visual guides.

44. Discord Server Queen (The Moderator) 💬🛡️

  • Keywords: Roles, Channels, Ban, Bot, Mod.
  • Vibe: High control, organized, and community-focused.
  • Best Use Case: Backend management and user role logic.

45. Snapchat Queen (The Quickie) 👻⏳

  • Keywords: Streaks, Snap, Filters, Temporary.
  • Vibe: Lives in the moment. Fast and fleeting.
  • Best Use Case: Ephemeral data (data that expires) and privacy tech.

46. Tumblr Queen (The Alt-Classic) 🕯️🎞️

  • Keywords: Niche, Fandom, Aesthetic, Subculture.
  • Vibe: Artistic, moody, and deeply devoted to a hobby.
  • Best Use Case: Fan sites and artsy portfolio designs.

47. Manifesting Queen (The Spiritual) ✨🔮

  • Keywords: Vibration, Energy, Universe, Desires.
  • Vibe: Focuses on the "Intent" behind the code.
  • Best Use Case: Visionary roadmaps and product "manifestos."

48. Morning Routine Queen (The Disciplined) ☀️🥛

  • Keywords: 5AM Club, Matcha, To-do List, Productive.
  • Vibe: Extreme discipline and efficiency.
  • Best Use Case: Writing task management apps and productivity tools.

49. Luxury Travel Queen (The Jetsetter) 🛥️🥂

  • Keywords: First Class, Suite, Private, Exclusive.
  • Vibe: High cost, high quality, only the best.
  • Best Use Case: High-end, VIP-only web portals.

50. Pick-Me Queen (The Satirical) 🤡🙄

  • Keywords: "I'm not like other girls," Quirky, Natural.
  • Vibe: (Usually used sarcastically) To poke fun at "trying too hard."
  • Best Use Case: Writing satirical or edgy social media copy.

Feimatrix

https://Feimatrix.com

Author: aifeifei798

Likes: 5

Downloads: 0

Tags: transformers, safetensors, gemma4, image-text-to-text, roleplay, gemma, sillytavern, idol, pytorch, DarkIdol, Queen, any-to-any, OpenClaw, base_model:google/gemma-4-E4B-it, base_model:finetune:google/gemma-4-E4B-it, license:apache-2.0, endpoints_compatible, region:us

huihui-ai/Huihui-gemma-4-E2B-it-abliterated


library_name: transformers license: apache-2.0 license_link: https://ai.google.dev/gemma/docs/gemma_4_license pipeline_tag: any-to-any base_model:

  • google/gemma-4-E2B-it tags:
  • abliterated
  • uncensored

huihui-ai/Huihui-gemma-4-E2B-it-abliterated

This is an uncensored version of google/gemma-4-E2B-it created with abliteration (see remove-refusals-with-transformers to know more about it). This is a crude, proof-of-concept implementation to remove refusals from an LLM model without using TransformerLens.

Usage Warnings

  • Risk of Sensitive or Controversial Outputs: This model’s safety filtering has been significantly reduced, potentially generating sensitive, controversial, or inappropriate content. Users should exercise caution and rigorously review generated outputs.

  • Not Suitable for All Audiences: Due to limited content filtering, the model’s outputs may be inappropriate for public settings, underage users, or applications requiring high security.

  • Legal and Ethical Responsibilities: Users must ensure their usage complies with local laws and ethical standards. Generated content may carry legal or ethical risks, and users are solely responsible for any consequences.

  • Research and Experimental Use: It is recommended to use this model for research, testing, or controlled environments, avoiding direct use in production or public-facing commercial applications.

  • Monitoring and Review Recommendations: Users are strongly advised to monitor model outputs in real-time and conduct manual reviews when necessary to prevent the dissemination of inappropriate content.

  • No Default Safety Guarantees: Unlike standard models, this model has not undergone rigorous safety optimization. huihui.ai bears no responsibility for any consequences arising from its use.

Donation

Your donation helps us continue our further development and improvement, a cup of coffee can do it.
  • bitcoin:
  bc1qqnkhuchxw0zqjh2ku3lu4hq45hc6gy84uk70ge
  • Support our work on Ko-fi!

Author: huihui-ai

Likes: 4

Downloads: 0

Tags: transformers, safetensors, gemma4, image-text-to-text, abliterated, uncensored, any-to-any, base_model:google/gemma-4-E2B-it, base_model:finetune:google/gemma-4-E2B-it, license:apache-2.0, endpoints_compatible, region:us

GorankLabs/Ranker-Gemma-4B


language:

  • en base_model: google/gemma-3-4b-it base_model_relation: finetune library_name: transformers pipeline_tag: text-generation license: gemma tags:
  • lora
  • peft
  • gemma-3
  • gemma3
  • search
  • retrieval
  • grounded-generation
  • question-answering
  • text-generation
  • ranker

GorankLabs/Ranker-Gemma-4B

GorankLabs/Ranker-Gemma-4B is a LoRA fine-tune built on top of google/gemma-3-4b-it, designed for fast search-style answering, grounded summarization, and concise web-assisted responses.

The model is tuned to behave like a lightweight answer engine: it prioritizes speed, directness, readable structure, and strong grounding behavior when search results or retrieved passages are included in the prompt. Instead of drifting into generic assistant chatter, it is shaped to answer first, synthesize quickly, and stay close to the evidence it is given.

Benchmark

image

What It Does

This model is intended for search-centric use cases such as:

  • fast question answering over retrieved web results
  • concise evidence-grounded summaries
  • direct comparison answers with citations
  • retrieval-augmented chat
  • lightweight ranking-oriented answer synthesis

It is especially useful when your application already has a retrieval layer and needs a smaller model that can turn search snippets, passages, and source blocks into a clean final answer.

Model Overview

  • Base model: google/gemma-3-4b-it
  • Fine-tune type: LoRA
  • Task style: grounded answer generation
  • Package name: GorankLabs/Ranker-Gemma-4B
  • Focus: fast search-engine style response generation

The adapter targets key attention and MLP projection layers to steer the base instruct model toward sharper retrieval use, tighter answer formatting, and more disciplined grounded outputs.

Intended Behavior

GorankLabs/Ranker-Gemma-4B is tuned to:

  • lead with the answer instead of a long preamble
  • keep outputs compact, readable, and information-dense
  • synthesize multiple sources into one coherent response
  • distinguish evidence from inference
  • mention uncertainty when retrieval is weak or incomplete
  • use source ids like [1] when inline search results are provided
  • avoid pretending it searched beyond the context it actually received

When no search evidence is provided, the model still behaves like a concise instruct assistant. When search evidence is present, it shifts into grounded answer mode.

Built-In Search-Aware Prompting

This repository includes a packaged prompt setup that makes the model search-aware at inference time.

Relevant files:

  • chat_template.jinja
  • tokenizer_config.json
  • search_config.json
  • citation_schema.json
  • generation_config.json

The chat template includes instructions for handling inline retrieved evidence. If your inference layer injects search results directly into the prompt, the model is guided to treat them as sources and produce a more search-engine-like answer.

Recommended inline format:

SEARCH RESULTS

[1] Example title
URL: https://example.com
Published: 2026-04-08
Snippet: Important fact here.
- Supporting passage here.

[2] Another source
URL: https://example.org
Snippet: Another relevant fact.

With this format, the model is expected to:

  • answer from the supplied evidence first
  • cite source ids where appropriate
  • mention concrete dates for time-sensitive topics
  • state when the available evidence is weak, conflicting, or insufficient

Use Cases

  • search answer engines
  • retrieval-augmented generation pipelines
  • document-grounded assistants
  • web result synthesis
  • ranking plus summarization workflows
  • compact research copilots

Limitations

This repository does not include a live web search engine, browser, crawler, or reranker by itself. The model is search-aware, not search-capable on its own. It performs best when another system retrieves the documents, snippets, or web results and passes them into the prompt.

Like other small models, it may still:

  • over-compress nuanced topics
  • inherit errors from bad retrieval
  • struggle when sources are sparse or contradictory
  • need careful prompt formatting for best citation behavior

Recommended Deployment Pattern

For best results, use the model inside a retrieval pipeline:

  1. Run search or document retrieval upstream.
  2. Select the highest-quality passages.
  3. Inject them into the prompt using a clear source format.
  4. Let the model synthesize the final grounded answer.

This setup works well for products that want search-engine style output without the cost and latency of a much larger model.

Training Notes

This is a LoRA adapter trained on top of Gemma 3 4B Instruct. The goal of the fine-tune is to improve:

  • answer directness
  • grounding discipline
  • search-result usage
  • compact explanatory style
  • source-aware response formatting

License

This adapter is built on top of google/gemma-3-4b-it. Use is subject to the applicable Gemma license terms and upstream access conditions.

Author: GorankLabs

Likes: 3

Downloads: 0

Tags: transformers, safetensors, gemma3, image-text-to-text, lora, peft, gemma-3, search, retrieval, grounded-generation, question-answering, text-generation, ranker, conversational, en, base_model:google/gemma-3-4b-it, base_model:finetune:google/gemma-3-4b-it, license:gemma, text-generation-inference, endpoints_compatible, region:us

inclusionAI/TC-AE

TC-AE: Unlocking Token Capacity for Deep Compression Autoencoders

<p align="center"> <a href=""><img src="https://img.shields.io/badge/Paper-Arxiv-b31b1b.svg" alt="arXiv"></a>&nbsp; <a href="https://huggingface.co/inclusionAI/TC-AE/tree/main"><img src="https://img.shields.io/badge/🤗%20Hugging%20Face-Models-yellow" alt="Models"></a> </p> <div align="center"> <a href="https://tliby.github.io/" target="_blank">Teng&nbsp;Li</a><sup>1,2*</sup>, <a href="https://huang-ziyuan.github.io/" target="_blank">Ziyuan&nbsp;Huang</a><sup>1,*,✉</sup>, <a href="https://scholar.google.com/citations?user=kwDXTpAAAAAJ&hl=en" target="_blank">Cong&nbsp;Chen</a><sup>1,3,*</sup>, <a href="https://ychenl.github.io/" target="_blank">Yangfu&nbsp;Li</a><sup>1,4</sup>, <a href="https://qc-ly.github.io/" target="_blank">Yuanhuiyi&nbsp;Lyu</a><sup>1,5</sup>, <br> <a href="#" target="_blank">Dandan&nbsp;Zheng</a><sup>1</sup>, <a href="https://scholar.google.com/citations?user=Ljk2BvIAAAAJ&hl=en" target="_blank">Chunhua&nbsp;Shen</a><sup>3</sup>, <a href="https://eejzhang.people.ust.hk/" target="_blank">Jun&nbsp;Zhang</a><sup>2✉</sup><br> <sup>1</sup>Inclusion AI, Ant Group, <sup>2</sup>HKUST, <sup>3</sup>ZJU, <sup>4</sup>ECNU, <sup>5</sup>HKUST (GZ) <br> <sup>*</sup>Equal contribution, ✉ Corresponding authors <br> </div>

News

  • [2026/04/09] Research paper, code, and models are released for TC-AE!

Introduction

<p align="center"> <img src="assets/pipeline.png" width=98%> <p>

TC-AE is a novel Vision Transformer (ViT)-based tokenizer for deep image compression and visual generation. Traditional deep compression methods typically increase channel dimensions to maintain reconstruction quality at high compression ratios, but this often leads to representation collapse that degrades generative performance. TC-AE addresses this fundamental challenge from a new perspective: optimizing the token space — the critical bridge between pixels and latent representations. By scaling token numbers and enhancing their semantic structure, TC-AE achieves superior reconstruction and generation quality. Key Innovations:

  • Token Space Optimization: First to address representation collapse through token sapce optimization
  • Staged Token Compression: Decomposes token-to-latent mapping into two stages, reducing structural information loss in the bottleneck
  • Semantic Enhancement: Incorporates self-supervised learning to produce more generative-friendly latents

🚀 In this codebase, we release:

  • Pre-trained TC-AE tokenizer weights and evaluation code
  • Diffusion model training and evaluation code

Environment Setup

To set up the environment for TC-AE, follow these steps:

conda create -n tcae python=3.9
conda activate tcae
pip install -r requirements.txt

Download Checkpoints

Download the pre-trained TC-AE weights and place them in the results/ directory:

| Tokenizer | Compression Ratio | rFID | LPIPS | Pretrained Weights | | --------- | ----------------- | ---- | ----- | ------------------------------------------------------------ | | TC-AE-SL | f32d128 | 0.35 | 0.060 | Models |

Reconstruction Evaluation

Image Reconstruction Demo
python tcae/script/demo_recon.py \
    --img_folder /path/to/your/images \
    --output_folder /path/to/output \
    --ckpt_path results/tcae.pt \
    --config configs/TC-AE-SL.yaml \
    --rank 0
ImageNet Evaluation

Evaluate reconstruction quality on ImageNet validation set:

python tcae/script/eval_recon.py \
    --ckpt_path results/tcae.pt \
    --dataset_root /path/to/imagenet_val \
    --config configs/TC-AE-SL.yaml \
    --rank 0

Generation Evaluation

Our DiT architecture and training pipeline are based on RAE and VA-VAE.

Prepare ImageNet Latents for Training

Extract and cache latent representations from ImageNet training set:

accelerate launch \
    --mixed_precision bf16 \
    diffusion/script/extract_features.py \
    --data_path /path/to/imagenet_train \
    --batch_size 50 \
    --tokenizer_cfg_path configs/TC-AE-SL.yaml \
    --tokenizer_ckpt_path results/tcae.pt

This will cache latents to results/cached_latents/imagenet_train_256/.

Training

Train a DiT-XL model on the extracted latents:

mkdir -p results/dit
torchrun --standalone --nproc_per_node=8 \
    diffusion/script/train_dit.py \
    --config configs/DiT-XL.yaml \
    --data-path results/cached_latents/imagenet_train_256 \
    --results-dir results/dit \
    --image-size 256 \
    --precision bf16
Sampling

Generate images using the trained diffusion model:

mkdir -p results/dit/samples
torchrun --standalone --nnodes=1 --nproc_per_node=8 \
    diffusion/script/sample_ddp_dit.py \
    --config configs/DiT-XL.yaml \
    --sample-dir results/dit/samples \
    --precision bf16 \
    --label-sampling equal \
    --tokenizer_cfg_path configs/TC-AE-SL.yaml \
    --tokenizer_ckpt_path results/tcae.pt
Evaluation

Download the ImageNet reference statistics: adm_in256_stats.npz and place it in results/.

python diffusion/script/eval_dit.py \
    --generated_dir results/dit/samples/DiT-0100000-cfg-1.00-bs100-ODE-50-euler-bf16 \
    --reference_npz results/adm_in256_stats.npz \
    --batch-size 512 \
    --num-workers 8

Acknowledgements

The codebase is built on HieraTok, RAE, VA-VAE, iBOT. Thanks for their efforts!

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

Author: inclusionAI

Likes: 3

Downloads: 0

Tags: arxiv:2509.23736, region:us

inferencerlabs/GLM-5.1-MLX-4.8bit

Author: inferencerlabs

Likes: 3

Downloads: 0

Tags: mlx, safetensors, glm_moe_dsa, quantized, text-generation, conversational, en, base_model:zai-org/GLM-5.1, base_model:quantized:zai-org/GLM-5.1, region:us