Todays AI Summary

AI Developments: Scriptwriting LLMs, Mobile Agents, and Efficiency Optimizations

Today's AI landscape features advancements in language models tailored for creative tasks, mobile automation, and efficiency improvements for on-device deployment. Research focuses on enhancing retrieval-augmented generation and improving reasoning capabilities.

Noteworthy Papers

  • Enhancing Retrieval-Augmented Generation with Entity Linking for Educational Platforms: This paper introduces an enhanced RAG architecture that integrates Entity Linking to improve the accuracy of educational question-answering systems. The system uses a Wikidata-based Entity Linking module and implements three re-ranking strategies to combine semantic and entity-based information. The results show that, in domain-specific contexts, the hybrid schema based on reciprocal rank fusion significantly outperforms both the baseline and the cross-encoder approach.
  • Training-Time Action Conditioning for Efficient Real-Time Chunking: This research proposes a method for training vision-language-action models (VLAs) by simulating inference delay at training time and conditioning on action prefixes directly, which reduces computational overhead and maintains task performance.
  • M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG: This paper introduces a large-scale benchmark for evaluating retrieval-augmented VQA across languages and modalities. The benchmark covers 42 languages and 56 regional dialects and registers, comprising over 80,000 culturally diverse image-question pairs.

Model Highlights

  • FutureMa/Qwen3-8B-Drama-Thinking: This model is a fine-tuned version of Qwen3-8B, specializing in screenwriting with explicit creative reasoning chains. It uses <think>...</think> tags to show internal reasoning, analyzes character psychology, and plans structure. The model was trained on a custom drama thinking dataset and shows significant improvements in output length, thinking depth, and creative reasoning compared to the base model.
  • zai-org/AutoGLM-Phone-9B: This model is a mobile intelligent assistant framework built on AutoGLM, capable of understanding smartphone screens through multimodal perception and executing automated operations to complete tasks. It controls devices via ADB, uses a vision-language model for screen understanding, and leverages intelligent planning to generate and execute action sequences.
  • embedl/Llama-3.2-3B-Instruct-FlashHead: This model is an optimized version of Llama-3.2-3B-Instruct using FlashHead, Embedl’s efficient replacement for the language model head, reducing size while preserving accuracy. It is designed for low-latency inference on NVIDIA RTX GPUs and matches the Llama-3.2-3B-Instruct baseline within rounding error on common benchmarks.

Key Takeaways

  • Specialized LLMs: Models are becoming increasingly specialized for creative tasks like scriptwriting, incorporating reasoning and planning capabilities.
  • Mobile Automation: AI agents are being developed to automate tasks on mobile devices, leveraging multimodal perception and intelligent planning.
  • Efficiency Optimizations: Techniques like FlashHead and quantization are being used to improve the efficiency of language models for on-device deployment, maintaining accuracy while reducing latency.
  • RAG Enhancements: Research is focused on improving the accuracy and effectiveness of retrieval-augmented generation systems through entity linking and multilingual capabilities.

AI Papers for 2025-12-09

Enhancing Retrieval-Augmented Generation with Entity Linking for Educational Platforms

In the era of Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) architectures are gaining significant attention for their ability to ground language generation in reliable knowledge sources. Despite their impressive effectiveness in many areas, RAG systems based solely on semantic similarity often fail to ensure factual accuracy in specialized domains, where terminological ambiguity can affect retrieval relevance. This study proposes an enhanced RAG architecture that integrates a factual signal derived from Entity Linking to improve the accuracy of educational question-answering systems in Italian. The system includes a Wikidata-based Entity Linking module and implements three re-ranking strategies to combine semantic and entity-based information: a hybrid score weighting model, reciprocal rank fusion, and a cross-encoder re-ranker. Experiments were conducted on two benchmarks: a custom academic dataset and the standard SQuAD-it dataset. Results show that, in domain-specific contexts, the hybrid schema based on reciprocal rank fusion significantly outperforms both the baseline and the cross-encoder approach, while the cross-encoder achieves the best results on the general-domain dataset. These findings confirm the presence of an effect of domain mismatch and highlight the importance of domain adaptation and hybrid ranking strategies to enhance factual precision and reliability in retrieval-augmented generation. They also demonstrate the potential of entity-aware RAG systems in educational environments, fostering adaptive and reliable AI-based tutoring tools.

Training-Time Action Conditioning for Efficient Real-Time Chunking

Real-time chunking (RTC) enables vision-language-action models (VLAs) to generate smooth, reactive robot trajectories by asynchronously predicting action chunks and conditioning on previously committed actions via inference-time inpainting. However, this inpainting method introduces computational overhead that increases inference latency. In this work, we propose a simple alternative: simulating inference delay at training time and conditioning on action prefixes directly, eliminating any inference-time overhead. Our method requires no modifications to the model architecture or robot runtime, and can be implemented with only a few additional lines of code. In simulated experiments, we find that training-time RTC outperforms inference-time RTC at higher inference delays. In real-world experiments on box building and espresso making tasks with the $π_{0.6}$ VLA, we demonstrate that training-time RTC maintains both task performance and speed parity with inference-time RTC while being computationally cheaper. Our results suggest that training-time action conditioning is a practical drop-in replacement for inference-time inpainting in real-time robot control.

Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity

Reinforcement Learning (RL) has become the de facto standard for tuning LLMs to solve tasks involving reasoning. However, growing evidence shows that models trained in such way often suffer from a significant loss in diversity. We argue that this arises because RL implicitly optimizes the "mode-seeking" or "zero-forcing" Reverse KL to a target distribution causing the model to concentrate mass on certain high-probability regions of the target while neglecting others. In this work, we instead begin from an explicit target distribution, obtained by filtering out incorrect answers while preserving the relative probabilities of correct ones. Starting from a pre-trained LLM, we approximate this target distribution using the $α$-divergence family, which unifies prior approaches and enables direct control of the precision-diversity trade-off by interpolating between mode-seeking and mass-covering divergences. On a Lean theorem-proving benchmark, our method achieves state-of-the-art performance along the coverage-precision Pareto frontier, outperforming all prior methods on the coverage axis.

AQUA-Net: Adaptive Frequency Fusion and Illumination Aware Network for Underwater Image Enhancement

Underwater images often suffer from severe color distortion, low contrast, and a hazy appearance due to wavelength-dependent light absorption and scattering. Simultaneously, existing deep learning models exhibit high computational complexity, which limits their practical deployment for real-time underwater applications. To address these challenges, this paper presents a novel underwater image enhancement model, called Adaptive Frequency Fusion and Illumination Aware Network (AQUA-Net). It integrates a residual encoder decoder with dual auxiliary branches, which operate in the frequency and illumination domains. The frequency fusion encoder enriches spatial representations with frequency cues from the Fourier domain and preserves fine textures and structural details. Inspired by Retinex, the illumination-aware decoder performs adaptive exposure correction through a learned illumination map that separates reflectance from lighting effects. This joint spatial, frequency, and illumination design enables the model to restore color balance, visual contrast, and perceptual realism under diverse underwater conditions. Additionally, we present a high-resolution, real-world underwater video-derived dataset from the Mediterranean Sea, which captures challenging deep-sea conditions with realistic visual degradations to enable robust evaluation and development of deep learning models. Extensive experiments on multiple benchmark datasets show that AQUA-Net performs on par with SOTA in both qualitative and quantitative evaluations while using less number of parameters. Ablation studies further confirm that the frequency and illumination branches provide complementary contributions that improve visibility and color representation. Overall, the proposed model shows strong generalization capability and robustness, and it provides an effective solution for real-world underwater imaging applications.

M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG

Vision-language models (VLMs) have achieved strong performance in visual question answering (VQA), yet they remain constrained by static training data. Retrieval-Augmented Generation (RAG) mitigates this limitation by enabling access to up-to-date, culturally grounded, and multilingual information; however, multilingual multimodal RAG remains largely underexplored. We introduce M4-RAG, a massive-scale benchmark covering 42 languages and 56 regional dialects and registers, comprising over 80,000 culturally diverse image-question pairs for evaluating retrieval-augmented VQA across languages and modalities. To balance realism with reproducibility, we build a controlled retrieval environment containing millions of carefully curated multilingual documents relevant to the query domains, approximating real-world retrieval conditions while ensuring consistent experimentation. Our systematic evaluation reveals that although RAG consistently benefits smaller VLMs, it fails to scale to larger models and often even degrades their performance, exposing a critical mismatch between model size and current retrieval effectiveness. M4-RAG provides a foundation for advancing next-generation RAG systems capable of reasoning seamlessly across languages, modalities, and cultural contexts.

MaxShapley: Towards Incentive-compatible Generative Search with Fair Context Attribution

Generative search engines based on large language models (LLMs) are replacing traditional search, fundamentally changing how information providers are compensated. To sustain this ecosystem, we need fair mechanisms to attribute and compensate content providers based on their contributions to generated answers. We introduce MaxShapley, an efficient algorithm for fair attribution in generative search pipelines that use retrieval-augmented generation (RAG). MaxShapley is a special case of the celebrated Shapley value; it leverages a decomposable max-sum utility function to compute attributions with linear computation in the number of documents, as opposed to the exponential cost of Shapley values. We evaluate MaxShapley on three multi-hop QA datasets (HotPotQA, MuSiQUE, MS MARCO); MaxShapley achieves comparable attribution quality to exact Shapley computation, while consuming a fraction of its tokens--for instance, it gives up to an 8x reduction in resource consumption over prior state-of-the-art methods at the same attribution accuracy.

SymPyBench: A Dynamic Benchmark for Scientific Reasoning with Executable Python Code

We introduce, a large-scale synthetic benchmark of 15,045 university-level physics problems (90/10% train/test split). Each problem is fully parameterized, supporting an effectively infinite range of input configurations, and is accompanied by structured, step-by-step reasoning and executable Python code that produces the ground-truth solution for any parameter set. The benchmark contains three question types: MC-Symbolic (multiple-choice with symbolic options), MC-Numerical (multiple-choice with numerical options), and free-form (open-ended responses). These diverse formats test complementary reasoning skills. By leveraging the dynamic, code-driven nature of the benchmark, we introduce three novel evaluation metrics in addition to standard accuracy: Consistency Score, Failure Rate, and Confusion Rate, that quantify variability and uncertainty across problem variants. Experiments with state-of-the-art instruction-tuned language models reveal both strengths and limitations in scientific reasoning, positioning SymPyBench as a foundation for developing more robust and interpretable reasoning systems

Trusted AI Agents in the Cloud

AI agents powered by large language models are increasingly deployed as cloud services that autonomously access sensitive data, invoke external tools, and interact with other agents. However, these agents run within a complex multi-party ecosystem, where untrusted components can lead to data leakage, tampering, or unintended behavior. Existing Confidential Virtual Machines (CVMs) provide only per binary protection and offer no guarantees for cross-principal trust, accelerator-level isolation, or supervised agent behavior. We present Omega, a system that enables trusted AI agents by enforcing end-to-end isolation, establishing verifiable trust across all contributing principals, and supervising every external interaction with accountable provenance. Omega builds on Confidential VMs and Confidential GPUs to create a Trusted Agent Platform that hosts many agents within a single CVM using nested isolation. It also provides efficient multi-agent orchestration with cross-principal trust establishment via differential attestation, and a policy specification and enforcement framework that governs data access, tool usage, and inter-agent communication for data protection and regulatory compliance. Implemented on AMD SEV-SNP and NVIDIA H100, Omega fully secures agent state across CVM-GPU, and achieves high performance while enabling high-density, policy-compliant multi-agent deployments at cloud scale.

Impugan: Learning Conditional Generative Models for Robust Data Imputation

Incomplete data are common in real-world applications. Sensors fail, records are inconsistent, and datasets collected from different sources often differ in scale, sampling rate, and quality. These differences create missing values that make it difficult to combine data and build reliable models. Standard imputation methods such as regression models, expectation-maximization, and multiple imputation rely on strong assumptions about linearity and independence. These assumptions rarely hold for complex or heterogeneous data, which can lead to biased or over-smoothed estimates. We propose Impugan, a conditional Generative Adversarial Network (cGAN) for imputing missing values and integrating heterogeneous datasets. The model is trained on complete samples to learn how missing variables depend on observed ones. During inference, the generator reconstructs missing entries from available features, and the discriminator enforces realism by distinguishing true from imputed data. This adversarial process allows Impugan to capture nonlinear and multimodal relationships that conventional methods cannot represent. In experiments on benchmark datasets and a multi-source integration task, Impugan achieves up to 82\% lower Earth Mover's Distance (EMD) and 70\% lower mutual-information deviation (MI) compared to leading baselines. These results show that adversarially trained generative models provide a scalable and principled approach for imputing and merging incomplete, heterogeneous data. Our model is available at: github.com/zalishmahmud/impuganBigData2025

Variational Quantum Rainbow Deep Q-Network for Optimizing Resource Allocation Problem

Resource allocation remains NP-hard due to combinatorial complexity. While deep reinforcement learning (DRL) methods, such as the Rainbow Deep Q-Network (DQN), improve scalability through prioritized replay and distributional heads, classical function approximators limit their representational power. We introduce Variational Quantum Rainbow DQN (VQR-DQN), which integrates ring-topology variational quantum circuits with Rainbow DQN to leverage quantum superposition and entanglement. We frame the human resource allocation problem (HRAP) as a Markov decision process (MDP) with combinatorial action spaces based on officer capabilities, event schedules, and transition times. On four HRAP benchmarks, VQR-DQN achieves 26.8% normalized makespan reduction versus random baselines and outperforms Double DQN and classical Rainbow DQN by 4.9-13.4%. These gains align with theoretical connections between circuit expressibility, entanglement, and policy quality, demonstrating the potential of quantum-enhanced DRL for large-scale resource allocation. Our implementation is available at: https://github.com/Analytics-Everywhere-Lab/qtrl/.

AI Models

FutureMa/Qwen3-8B-Drama-Thinking


license: apache-2.0 base_model: Qwen/Qwen3-8B tags:

  • qwen3
  • thinking
  • creative-writing
  • screenwriting
  • drama
  • chain-of-thought
  • reasoning
  • ms-swift
  • full-parameter-finetuning datasets:
  • custom-drama-thinking-dataset language:
  • en
  • zh library_name: transformers pipeline_tag: text-generation model-index:
  • name: Qwen3-8B-Drama-Thinking results:
    • task: type: text-generation name: Creative Script Writing metrics:
      • type: thinking_depth value: 9.0 name: Thinking Depth Score
      • type: script_format value: 9.0 name: Script Format Score
      • type: dramatic_craft value: 8.5 name: Dramatic Craft Score

Qwen3-8B-Drama-Thinking

This model is a full parameter fine-tuned version of Qwen/Qwen3-8B on a custom drama thinking dataset with explicit creative reasoning chains.

Model Description

  • Base Model: Qwen3-8B (8 billion parameters)
  • Training Method: Full Parameter Fine-tuning (NOT LoRA)
  • Training Framework: ms-swift
  • Training Data: Custom Drama Thinking Dataset (6,319 samples, avg ~5,000 tokens)
  • Specialization: Screenwriting with explicit <think>...</think> creative reasoning
  • Hardware: 2x NVIDIA H100 80GB SXM5
  • Training Time: 2 hours 46 minutes (3 epochs)
  • Training Cost: ~$17.86

Key Features

🎬 Professional Screenwriting Assistant

This model generates dramatic scripts with explicit creative deliberation:

  • Thinking Process Visible: Uses <think>...</think> tags to show internal reasoning
  • Deep Character Psychology: Analyzes motivations, defense mechanisms, subtext
  • Structural Planning: Three-act structure, emotional arcs, pacing decisions
  • Visual Storytelling: Symbolism, atmosphere, cinematographic choices
  • Professional Format: Correct screenplay formatting (scene headers, action lines, dialogue)

📊 Performance Comparison

Compared to base Qwen3-8B:

| Metric | Base Model | Fine-Tuned | Improvement | |--------|------------|------------|-------------| | Output Length | 1,071 tokens | 3,874 tokens | +262% | | Thinking Depth | 5/10 | 9/10 | +80% | | Creative Reasoning | 500 tokens | 3,400 tokens | +580% | | Craft Analysis | Generic | Professional | Qualitative leap |

🎯 Unique Value Proposition

This is not just a text generator - it's a creative thinking partner that externalizes the entire screenwriting process: from title analysis to character psychology to structural planning to final execution.

Training Details

Training Configuration

Model:              Qwen/Qwen3-8B
Template:           qwen3_thinking
Training Type:      Full Parameter (all 8B parameters)
Max Length:         8192 tokens (for long thinking chains)
Batch Size:         1 per device × 2 GPUs
Gradient Accum:     8 steps (effective batch size: 16)
Learning Rate:      1e-5
Epochs:             3
Optimization:       DeepSpeed Zero3 + Gradient Checkpointing
                    Liger Kernel, BF16 mixed precision
Loss Scale:         ignore_empty_think
GPU Memory:         ~74.62 GB per H100 (stable)

Dataset Characteristics

  • Samples: 6,319 dramatic script continuations
  • Average Length: ~5,000 tokens per sample
  • Max Length: ~6,100 tokens
  • Format: Conversations with <think>...</think> reasoning tags
  • Content:
    • Script opening scenes (title, description, initial dialogue)
    • Extensive creative deliberation (3,000+ tokens of thinking)
    • Script continuation with proper formatting
  • Style: Dramatic, emotionally intense scenarios (conflicts, reconciliation, tragedy)

Training Metrics

  • Final Loss: 0.844
  • Average Loss: 0.978
  • Loss Trajectory: 1.602 (start) → 0.82-0.83 (end)
  • Training Speed: ~8 seconds/iteration
  • Total Steps: 1,185
  • Checkpoints: 5 saved (400, 800, 900, 1000, 1185)

Usage

Quick Start (ms-swift)

# Install ms-swift
pip install ms-swift

# Inference (interactive mode)
swift infer \
    --ckpt_dir FutureMa/Qwen3-8B-Drama-Thinking \
    --template qwen3_thinking \
    --max_new_tokens 4096 \
    --temperature 0.7

Python API

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

from swift.llm import PtEngine, InferRequest, RequestConfig

# Initialize engine
engine = PtEngine(
    model_id_or_path="FutureMa/Qwen3-8B-Drama-Thinking",
    max_batch_size=1,
    torch_dtype="bfloat16"
)

# Create prompt
prompt = """Title: The Reunion
Description: Two estranged siblings meet at their childhood home after 20 years.

INT. FAMILY LIVING ROOM - DAY

SARAH (35) stands by the window, looking out at the garden.
MICHAEL (38) enters, hesitant."""

# Generate (streaming)
messages = [{'role': 'user', 'content': prompt}]
request = InferRequest(messages=messages)
config = RequestConfig(max_tokens=4096, temperature=0.7, stream=True)

for response in engine.infer([request], config)[0]:
    if response:
        print(response.choices[0].delta.content, end='', flush=True)

Using with Transformers (requires adapting checkpoint)

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "FutureMa/Qwen3-8B-Drama-Thinking",
    torch_dtype="bfloat16",
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")

messages = [
    {"role": "system", "content": "You are a creative screenwriter assistant specialized in dramatic storytelling with internal reasoning."},
    {"role": "user", "content": "Write a scene about forgiveness..."}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=4096, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Example Output

Input Prompt:

Title: The Last Apology
Description: A daughter arrives at her estranged father's deathbed...

Model Output (excerpt):

<think>
Okay. Let's go. I'm sitting here with the opening of *The Last Apology* fresh in my mind,
and I need to figure out exactly where this goes next—not just the next scene, but the next
emotional beat, the next structural pivot...

First, the title: *The Last Apology*. That's not just poetic—it's structural. It tells me
this story is about delayed recognition, about the finality of words left unsaid...

[3,400 tokens of deep creative analysis including:]
- Title deconstruction and thematic implications
- Character psychology analysis
- Three-act structure planning
- Visual language and symbolism
- Multiple narrative paths considered
- Professional screenwriting techniques
</think>

INT. HOSPITAL ROOM - NIGHT

ANNA (28), in a wrinkled business suit, hesitates at the doorway.

DAVID (65) lies in bed, breathing labored...

[Script continues with proper formatting]

Intended Use

✅ Recommended Use Cases

  1. Screenwriting Education: Learn professional creative thinking process
  2. Script Ideation: Generate story frameworks and narrative alternatives
  3. Story Consulting: Explore "what if" scenarios with explicit reasoning
  4. Creative Brainstorming: Understand decision-making in storytelling
  5. Draft Development: Plan structure before execution

❌ Not Recommended For

  1. Final Shooting Scripts: Requires human refinement for production
  2. Comedy/Action Genres: Training bias toward dramatic content
  3. Long-form Series: Single-pass generation may lack consistency
  4. Immediate Production: Dialogue needs naturalization

Evaluation Results

Quantitative Metrics (vs. Base Model)

| Aspect | Score | Base Model | Improvement | |--------|-------|------------|-------------| | Thinking Depth | 9/10 | 5/10 | +80% | | Script Format | 9/10 | 8/10 | +13% | | Dramatic Craft | 8.5/10 | 8/10 | +6% | | Character Psychology | 9/10 | 6/10 | +50% | | Decision Transparency | 9/10 | 5/10 | +80% | | Overall | 8.1/10 | 6.9/10 | +17% |

Qualitative Improvements

  • Professional Voice: Sounds like experienced screenwriter
  • Structural Thinking: Explicit three-act planning
  • Meta-Awareness: "This isn't just a script. It's a reckoning."
  • Non-Linear Reasoning: Considers alternatives, backtracks, refines
  • Craft-Oriented: Explains why choices serve the story

Limitations

  1. Thinking Verbosity: Generates ~3,400 tokens of thinking (87% of output)

    • May be excessive for quick tasks
    • Consider using max_new_tokens to limit length
  2. Incomplete Execution: Token budget consumed by thinking

    • Many planned scenes not fully generated
    • May need 6,000-8,000 token limit for complete scripts
  3. Dialogue Naturalness: More direct/literary than conversational

    • Training data style influences output
    • May need post-processing for natural speech
  4. Training Data Bias: Skews toward melodramatic scenarios

    • Less suited for subtle/realistic dialogue
    • Best for emotionally intense stories

Training Insights

What Made This Successful

  1. 8192 Token Context: Essential for capturing full thinking chains

    • Initial assumption of 2048 would have truncated data
    • Average sample length: ~5,000 tokens
  2. DeepSpeed Zero3: Required (not optional)

    • Single H100: Would need ~109-114 GB (OOM)
    • Zero3 sharding: ~74.62 GB per card ✅
  3. Full Parameter Training: Worth the cost

    • Deeper capability transfer than LoRA
    • Better thinking process internalization
    • Cost: $17.86 (2.8 hours) vs ~$5 for LoRA
  4. Quality Training Data: 6,319 long-form reasoning examples

    • Actual creative process in <think> tags
    • High-quality dramatic writing

Citation

@misc{qwen3-drama-thinking-2025,
  author = {FutureMa},
  title = {Qwen3-8B-Drama-Thinking: Full Parameter Fine-tuning for Creative Screenwriting},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/FutureMa/Qwen3-8B-Drama-Thinking}},
  note = {Full parameter fine-tuning on 6,319 drama samples with explicit reasoning chains}
}

Acknowledgments

  • Base Model: Qwen Team - Qwen3-8B
  • Training Framework: ms-swift - ModelScope SWIFT
  • Infrastructure: Lambda Cloud - 2x H100 80GB SXM5
  • Dataset: Custom Drama Thinking Dataset (6,319 samples)

Model Card Contact

For questions or feedback:

  • HuggingFace: @FutureMa
  • GitHub Issues: Report via ms-swift repository

Training Date: 2025-12-08 Training Duration: 2h 46m Model Size: ~16GB (BF16 precision) Recommended VRAM: 16GB+ for inference

Author: FutureMa

Likes: 12

Downloads: 0

Tags: transformers, safetensors, qwen3, text-generation, thinking, creative-writing, screenwriting, drama, chain-of-thought, reasoning, ms-swift, full-parameter-finetuning, conversational, en, zh, dataset:custom-drama-thinking-dataset, base_model:Qwen/Qwen3-8B, base_model:finetune:Qwen/Qwen3-8B, license:apache-2.0, model-index, text-generation-inference, endpoints_compatible, region:us

mradermacher/GLM-4.6V-Flash-GGUF


base_model: zai-org/GLM-4.6V-Flash language:

  • zh
  • en library_name: transformers license: mit mradermacher: readme_rev: 1 quantized_by: mradermacher

About

<!-- ### quantize_version: 2 --> <!-- ### output_tensor_quantised: 1 --> <!-- ### convert_type: hf --> <!-- ### vocab_type: --> <!-- ### tags: --> <!-- ### quants: x-f16 Q4_K_S Q2_K Q8_0 Q6_K Q3_K_M Q3_K_S Q3_K_L Q4_K_M Q5_K_S Q5_K_M IQ4_XS --> <!-- ### quants_skip: --> <!-- ### skip_mmproj: -->

static quants of https://huggingface.co/zai-org/GLM-4.6V-Flash

<!-- provided-files -->

For a convenient overview and download list, visit our model page for this model.

weighted/imatrix quants are available at https://huggingface.co/mradermacher/GLM-4.6V-Flash-i1-GGUF

Usage

If you are unsure how to use GGUF files, refer to one of TheBloke's READMEs for more details, including on how to concatenate multi-part files.

Provided Quants

(sorted by size, not necessarily quality. IQ-quants are often preferable over similar sized non-IQ quants)

| Link | Type | Size/GB | Notes | |:-----|:-----|--------:|:------| | GGUF | Q2_K | 4.1 | | | GGUF | Q3_K_S | 4.7 | | | GGUF | Q3_K_M | 5.1 | lower quality | | GGUF | Q3_K_L | 5.3 | | | GGUF | IQ4_XS | 5.4 | | | GGUF | Q4_K_S | 5.9 | fast, recommended | | GGUF | Q4_K_M | 6.3 | fast, recommended | | GGUF | Q5_K_S | 6.8 | | | GGUF | Q5_K_M | 7.2 | | | GGUF | Q6_K | 8.4 | very good quality | | GGUF | Q8_0 | 10.1 | fast, best quality | | GGUF | f16 | 18.9 | 16 bpw, overkill |

Here is a handy graph by ikawrakow comparing some lower-quality quant types (lower is better):

image.png

And here are Artefact2's thoughts on the matter: https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9

FAQ / Model Request

See https://huggingface.co/mradermacher/model_requests for some answers to questions you might have and/or if you want some other model quantized.

Thanks

I thank my company, nethype GmbH, for letting me use its servers and providing upgrades to my workstation to enable this work in my free time.

<!-- end -->

Author: mradermacher

Likes: 11

Downloads: 0

Tags: transformers, gguf, zh, en, base_model:zai-org/GLM-4.6V-Flash, base_model:quantized:zai-org/GLM-4.6V-Flash, license:mit, endpoints_compatible, region:us, conversational

zai-org/AutoGLM-Phone-9B


license: mit language:

  • zh base_model:
  • zai-org/GLM-4.1V-9B-Base pipeline_tag: image-text-to-text tags:
  • agent

AutoGLM-Phone-9B

<div align="center"> <img src="https://raw.githubusercontent.com/zai-org/Open-AutoGLM/refs/heads/main/resources/logo.svg" width="20%"/> </div> <p align="center"> 👋 Join our <a href="https://raw.githubusercontent.com/zai-org/Open-AutoGLM/refs/heads/main/resources/WECHAT.md" target="_blank">WeChat</a> community </p>

⚠️ This project is intended for research and educational purposes only.
Any use for illegal data access, system interference, or unlawful activities is strictly prohibited.
Please review our Terms of Use carefully.

Project Overview

Phone Agent is a mobile intelligent assistant framework built on AutoGLM, capable of understanding smartphone screens through multimodal perception and executing automated operations to complete tasks.
The system controls devices via ADB (Android Debug Bridge), uses a vision-language model for screen understanding, and leverages intelligent planning to generate and execute action sequences.

Users can simply describe tasks in natural language—for example, “Open Xiaohongshu and search for food recommendations.”
Phone Agent will automatically parse the intent, understand the current UI, plan the next steps, and carry out the entire workflow.

The system also includes:

  • Sensitive action confirmation mechanisms
  • Human-in-the-loop fallback for login or verification code scenarios
  • Remote ADB debugging, allowing device connection via WiFi or network for flexible remote control and development

Model Usage

We provide an open-source model usage guide to help you quickly download and deploy the model.
Please visit our GitHub for detailed instructions.

  • The model architecture is identical to GLM-4.1V-9B-Thinking.
    For deployment details, see the GLM-V repository.

Citation

If you find our work helpful, please cite the following paper:

@article{liu2024autoglm,
  title={Autoglm: Autonomous foundation agents for guis},
  author={Liu, Xiao and Qin, Bo and Liang, Dongzhu and Dong, Guang and Lai, Hanyu and Zhang, Hanchen and Zhao, Hanlin and Iong, Iat Long and Sun, Jiadai and Wang, Jiaqi and others},
  journal={arXiv preprint arXiv:2411.00820},
  year={2024}
}

Author: zai-org

Likes: 9

Downloads: 0

Tags: safetensors, glm4v, agent, image-text-to-text, conversational, zh, arxiv:2411.00820, base_model:zai-org/GLM-4.1V-9B-Base, base_model:finetune:zai-org/GLM-4.1V-9B-Base, license:mit, region:us

turboderp/GLM-4.6V-exl3


license: mit base_model: zai-org/GLM-4.6V base_model_relation: quantized quantized_by: turboderp tags:

  • exl3

EXL3 quants of GLM-4.6V

⚠️ Requires ExLlamaV3 v0.0.18 (or v0.0.17 dev branch)

Base bitrates:

4.00 bits per weight
(more to come)

Author: turboderp

Likes: 3

Downloads: 0

Tags: exl3, base_model:zai-org/GLM-4.6V, base_model:quantized:zai-org/GLM-4.6V, license:mit, region:us

embedl/Llama-3.2-3B-Instruct-FlashHead


license: other license_name: embedl-models-community-licence-1.0 license_link: https://github.com/embedl/embedl-models/blob/main/LICENSE base_model:

  • meta-llama/Llama-3.2-3B-Instruct tags:
  • text-generation-inference

Llama-3.2-3B-Instruct-FlashHead

My model banner

Optimized version of Llama-3.2-3B-Instruct using FlashHead, Embedl’s efficient replacement for the language model head, reducing size while preserving accuracy. Designed for low-latency inference on NVIDIA RTX GPUs, leveraging:

  • FlashHead
  • Custom vLLM generation via embedl-models

FlashHead matches the Llama-3.2-3B-Instruct baseline within rounding error on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, delivers SOTA on-device latency


Model Details

| Field | Value | |------------|------------| | Base Model | Llama-3.2-3B-Instruct | | Input / Output | Text → Text | | Release Date | 2025-12-08 | | Version | 1.0 | | Optimizations | FlashHead LM Head| | Developers | Embedl | | Licenses | Upstream: Meta Llama 3.2 License. Built with Llama. <br>Optimized components: Embedl Models Community Licence v1.0 (no redistribution) | | Intended Use | Text generation, reasoning, assistant-style interaction, and general-purpose NLP on NVIDIA RTX GPUs |


Optimizations

  • FlashHead LM Head - lightweight replacement for the dense LM head, significantly improving throughput.
  • Custom Runtime Integration - compatible with vLLM (0.10.2) via the embedl-models package.

Performance

Token Generation Speed (RTX 3500 Ada, batch size = 1)

| Precision | Tokens/sec | Speedup vs BF16 | |----------------|----------------|----------------------| | BF16 baseline | 54 | 1.0× | | FlashHead (Embedl) | 58 | 1.07× | | W4A16 baseline | 141 | 2.61× | | FlashHead W4A16 (Embedl) | 177 | 3.28× |

FlashHead improves end-to-end speed by 1.26× over state-of-the-art, while maintaining full accuracy parity.


Accuracy (Parity with Baseline)

| Method | MMLU-Pro | IFEval | BBH | TruthfulQA | GSM8K | |-------------|---------------|-------------|-------------|----------------|--------------| | Baseline | 0.31 | 0.57 | 0.57 | 0.57 | 0.77 | | FlashHead | 0.31 | 0.56 | 0.57 | 0.58 | 0.77 |

FlashHead closely matches baseline accuracy.


Installation

pip install embedl-models

The embedl-models package is required, it provides the optimized FlashHead implementation and quantized model runtime.


Usage Examples

Note (vLLM context length): max_model_len=131072 may fail on GPUs without enough free VRAM for the KV cache. If you see a KV cache memory error, lower max_model_len (or increase gpu_memory_utilization).

vLLM Inference

from vllm import SamplingParams
from embedl.models.vllm import LLM

model_id = "embedl/Llama-3.2-3B-Instruct-FlashHead"

if __name__ == "__main__":
    sampling = SamplingParams(max_tokens=128, temperature=0.0)
    llm = LLM(model=model_id, trust_remote_code=True, max_model_len=131072)
    
    prompt = "Write a haiku about coffee."
    output = llm.generate([prompt], sampling)
    print(output[0].outputs[0].text)

Interactive REPL Example

The run_repl() coroutine launches an interactive, streaming chat interface using the vLLM backend with FlashHead enabled.
It maintains an in-memory chat history and supports simple commands such as /exit to quit and /reset to clear context.

import asyncio
from embedl.models.vllm.demo import run_repl

model_id = "embedl/Llama-3.2-3B-Instruct-FlashHead"

if __name__ == "__main__":
    asyncio.run(
        run_repl(
            model=model_id,
            max_model_len=131072
        )
    )


⚠️ Important Warning: Hugging Face Transformers Support

FlashHead is currently not applied when using the Hugging Face transformers pipeline.
Generation through transformers will fall back to the standard dense LM head, disabling FlashHead acceleration.

For now, we strongly recommend using the vLLM integration (embedl.models.vllm.LLM) to ensure FlashHead is active and optimized for low-latency inference.

Full support for the Hugging Face transformers pipeline with FlashHead integration will be released in the coming days.


Limitations

  • Limited to vLLM 0.10.2 (pinned dependency)
  • Batch size = 1 (real-time generation)
  • Currently optimized for NVIDIA RTX GPUs

Roadmap

Planned improvements:

  • Advanced mixed precision quantization
  • Huggingface transformers generation
  • vLLM CLI benchmarking for detailed latency evaluation
  • lm-eval-harness integration for detailed accuracy evaluation
  • Upstream support in Transformers and vLLM
  • Compatibility with GGUF, MLC, Llama.cpp, Ollama, etc.
  • Broader model coverage (larger models, VLMs, VLAs)

License

  • Upstream: Meta Llama 3.2 License
  • Optimized Components: Embedl Models Community Licence v1.0 (no redistribution)

Contact

Enterprise & Commercial Inquiries sales@embedl.com

Technical Issues & Early Access https://github.com/embedl/embedl-models

More Information & Model Releases https://embedl.com


Partner & Developer Opportunities

If you are evaluating on-device inference, building products on SLMs, or exploring custom model optimization, reach out for:

  • Embedl SDK - AI optimization tools & profiling
  • Embedl HUB - benchmarking platform
  • Engineering support for on-prem/edge deployments
  • Migration guidance (Llama / Qwen / Gemma)
  • Early access & partner co-marketing opportunities

Contact: sales@embedl.com

Author: embedl

Likes: 3

Downloads: 0

Tags: safetensors, llama, text-generation-inference, base_model:meta-llama/Llama-3.2-3B-Instruct, base_model:finetune:meta-llama/Llama-3.2-3B-Instruct, license:other, region:us

embedl/Llama-3.2-3B-Instruct-FlashHead-W4A16


license: other license_name: embedl-models-community-licence-1.0 license_link: https://github.com/embedl/embedl-models/blob/main/LICENSE base_model:

  • meta-llama/Llama-3.2-3B-Instruct tags:
  • text-generation-inference

Llama-3.2-3B-Instruct-FlashHead-W4A16

My model banner

Optimized version of Llama-3.2-3B-Instruct using Quantization and FlashHead, Embedl’s efficient replacement for the language model head, reducing size while preserving accuracy. Designed for low-latency inference on NVIDIA RTX GPUs, leveraging:

  • FlashHead
  • Quantization (W4A16)
  • Custom vLLM generation via embedl-models

FlashHead matches the Llama-3.2-3B-Instruct baseline within rounding error on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, delivers SOTA on-device latency


Model Details

| Field | Value | |------------|------------| | Base Model | Llama-3.2-3B-Instruct | | Input / Output | Text → Text | | Release Date | 2025-12-08 | | Version | 1.0 | | Optimizations | FlashHead LM Head, Quantization (W4A16)| | Developers | Embedl | | Licenses | Upstream: Meta Llama 3.2 License. Built with Llama. <br>Optimized components: Embedl Models Community Licence v1.0 (no redistribution) | | Intended Use | Text generation, reasoning, assistant-style interaction, and general-purpose NLP on NVIDIA RTX GPUs |


Optimizations

  • FlashHead LM Head - lightweight replacement for the dense LM head, significantly improving throughput.
  • Quantization (W4A16) - large reduction in memory footprint and latency.
  • Custom Runtime Integration - compatible with vLLM (0.10.2) via the embedl-models package.

Performance

Token Generation Speed (RTX 3500 Ada, batch size = 1)

| Precision | Tokens/sec | Speedup vs BF16 | |----------------|----------------|----------------------| | BF16 baseline | 54 | 1.0× | | FlashHead (Embedl) | 58 | 1.07× | | W4A16 baseline | 141 | 2.61× | | FlashHead W4A16 (Embedl) | 177 | 3.28× |

FlashHead improves end-to-end speed by 1.26× over state-of-the-art, while maintaining full accuracy parity.


Accuracy (Parity with Baseline)

| Method | MMLU-Pro | IFEval | BBH | TruthfulQA | GSM8K | |-------------|---------------|-------------|-------------|----------------|--------------| | Baseline | 0.31 | 0.57 | 0.57 | 0.57 | 0.77 | | FlashHead | 0.31 | 0.56 | 0.57 | 0.58 | 0.77 |

FlashHead closely matches baseline accuracy.


Installation

pip install embedl-models

The embedl-models package is required, it provides the optimized FlashHead implementation and quantized model runtime.


Usage Examples

Note (vLLM context length): max_model_len=131072 may fail on GPUs without enough free VRAM for the KV cache. If you see a KV cache memory error, lower max_model_len (or increase gpu_memory_utilization).

vLLM Inference

from vllm import SamplingParams
from embedl.models.vllm import LLM

model_id = "embedl/Llama-3.2-3B-Instruct-FlashHead-W4A16"

if __name__ == "__main__":
    sampling = SamplingParams(max_tokens=128, temperature=0.0)
    llm = LLM(model=model_id, trust_remote_code=True, max_model_len=131072)
    
    prompt = "Write a haiku about coffee."
    output = llm.generate([prompt], sampling)
    print(output[0].outputs[0].text)

Interactive REPL Example

The run_repl() coroutine launches an interactive, streaming chat interface using the vLLM backend with FlashHead enabled.
It maintains an in-memory chat history and supports simple commands such as /exit to quit and /reset to clear context.

import asyncio
from embedl.models.vllm.demo import run_repl

model_id = "embedl/Llama-3.2-3B-Instruct-FlashHead-W4A16"

if __name__ == "__main__":
    asyncio.run(
        run_repl(
            model=model_id,
            max_model_len=131072
        )
    )


⚠️ Important Warning: Hugging Face Transformers Support

FlashHead is currently not applied when using the Hugging Face transformers pipeline.
Generation through transformers will fall back to the standard dense LM head, disabling FlashHead acceleration.

For now, we strongly recommend using the vLLM integration (embedl.models.vllm.LLM) to ensure FlashHead is active and optimized for low-latency inference.

Full support for the Hugging Face transformers pipeline with FlashHead integration will be released in the coming days.


Limitations

  • Limited to vLLM 0.10.2 (pinned dependency)
  • Batch size = 1 (real-time generation)
  • Currently optimized for NVIDIA RTX GPUs

Roadmap

Planned improvements:

  • Huggingface transformers generation
  • Advanced mixed precision quantization
  • vLLM CLI benchmarking for detailed latency evaluation
  • lm-eval-harness integration for detailed accuracy evaluation
  • Upstream support in Transformers and vLLM
  • Compatibility with GGUF, MLC, Llama.cpp, Ollama, etc.
  • Broader model coverage (larger models, VLMs, VLAs)

License

  • Upstream: Meta Llama 3.2 License
  • Optimized Components: Embedl Models Community Licence v1.0 (no redistribution)

Contact

Enterprise & Commercial Inquiries sales@embedl.com

Technical Issues & Early Access https://github.com/embedl/embedl-models

More Information & Model Releases https://embedl.com


Partner & Developer Opportunities

If you are evaluating on-device inference, building products on SLMs, or exploring custom model optimization, reach out for:

  • Embedl SDK - AI optimization tools & profiling
  • Embedl HUB - benchmarking platform
  • Engineering support for on-prem/edge deployments
  • Migration guidance (Llama / Qwen / Gemma)
  • Early access & partner co-marketing opportunities

Contact: sales@embedl.com

Author: embedl

Likes: 3

Downloads: 0

Tags: safetensors, llama, text-generation-inference, base_model:meta-llama/Llama-3.2-3B-Instruct, base_model:quantized:meta-llama/Llama-3.2-3B-Instruct, license:other, compressed-tensors, region:us

embedl/Llama-3.2-1B-Instruct-FlashHead-W4A16


license: other license_name: embedl-models-community-licence-1.0 license_link: https://github.com/embedl/embedl-models/blob/main/LICENSE base_model:

  • meta-llama/Llama-3.2-1B-Instruct tags:
  • text-generation-inference

Llama-3.2-1B-Instruct-FlashHead-W4A16

My model banner

Optimized version of Llama-3.2-1B-Instruct using Quantization and FlashHead, Embedl’s efficient replacement for the language model head, reducing size while preserving accuracy. Designed for low-latency inference on NVIDIA RTX GPUs, leveraging:

  • FlashHead
  • Quantization (W4A16)
  • Custom vLLM generation via embedl-models

FlashHead matches the baseline Llama-3.2-1B-Instruct within rounding on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, achieves H200-class throughput on RTX Ada GPUs.


Model Details

| Field | Value | |------------|------------| | Base Model | Llama-3.2-1B-Instruct | | Input / Output | Text → Text | | Release Date | 2025-12-08 | | Version | 1.0 | | Optimizations | FlashHead LM Head, Quantization (W4A16)| | Developers | Embedl | | Licenses | Upstream: Meta Llama 3.2 License. Built with Llama. <br>Optimized components: Embedl Models Community Licence v1.0 (no redistribution) | | Intended Use | Text generation, reasoning, assistant-style interaction, and general-purpose NLP on NVIDIA RTX GPUs |


Optimizations

  • FlashHead LM Head - lightweight replacement for the dense LM head, significantly improving throughput.
  • Quantization (W4A16) - large reduction in memory footprint and latency.
  • Custom Runtime Integration - compatible with vLLM (0.10.2) via the embedl-models package.

Performance

Token Generation Speed (RTX 3500 Ada, batch size = 1)

| Precision | Tokens/sec | Speedup vs BF16 | |----------------|----------------|----------------------| | BF16 baseline | 130 | 1.0× | | FlashHead (Embedl) | 163 | 1.25× | | W4A16 baseline | 278 | 2.14× | | FlashHead W4A16 (Embedl) | 485 | 3.73× |

FlashHead improves end-to-end speed by 1.75× over state-of-the-art, while maintaining full accuracy parity.

Measurement setup: vLLM 0.10.2, batch_size=1, prompt length=32, max_new_tokens=128, 10 warm-up runs, averaged over 100 runs.

NVIDIA H200 measurement: FP8, 512 Tokens/sec.


Accuracy (Parity with Baseline)

| Method | MMLU-Pro | HellaSwag | IFEval | BoolQ | BBH | TruthfulQA | GSM8K | |-------------|---------------|----------------|--------------|-------------|-------------|----------------|--------------| | Baseline | 0.18 | 0.59 | 0.45 | 0.69 | 0.38 | 0.36 | 0.46 | | FlashHead | 0.18 | 0.59 | 0.45 | 0.69 | 0.38 | 0.36 | 0.46 |

FlashHead closely matches baseline accuracy.


Installation

pip install embedl-models

The embedl-models package is required, it provides the optimized FlashHead implementation and quantized model runtime.


Usage Examples

Note (vLLM context length): max_model_len=131072 may fail on GPUs without enough free VRAM for the KV cache. If you see a KV cache memory error, lower max_model_len (or increase gpu_memory_utilization).

vLLM Inference

from vllm import SamplingParams
from embedl.models.vllm import LLM

model_id = "embedl/Llama-3.2-1B-Instruct-FlashHead-W4A16"

if __name__ == "__main__":
    sampling = SamplingParams(max_tokens=128, temperature=0.0)
    llm = LLM(model=model_id, trust_remote_code=True, max_model_len=131072)
    
    prompt = "Write a haiku about coffee."
    output = llm.generate([prompt], sampling)
    print(output[0].outputs[0].text)

Interactive REPL Example

The run_repl() coroutine launches an interactive, streaming chat interface using the vLLM backend with FlashHead enabled.
It maintains an in-memory chat history and supports simple commands such as /exit to quit and /reset to clear context.

import asyncio
from embedl.models.vllm.demo import run_repl

model_id = "embedl/Llama-3.2-1B-Instruct-FlashHead-W4A16"

if __name__ == "__main__":
    asyncio.run(
        run_repl(
            model=model_id,
            max_model_len=131072
        )
    )


⚠️ Important Warning: Hugging Face Transformers Support

FlashHead is currently not applied when using the Hugging Face transformers pipeline.
Generation through transformers will fall back to the standard dense LM head, disabling FlashHead acceleration.

For now, we strongly recommend using the vLLM integration (embedl.models.vllm.LLM) to ensure FlashHead is active and optimized for low-latency inference.

Full support for the Hugging Face transformers pipeline with FlashHead integration will be released in the coming days.


Limitations

  • Limited to vLLM 0.10.2 (pinned dependency)
  • Batch size = 1 (real-time generation)
  • Currently optimized for NVIDIA RTX GPUs

Roadmap

Planned improvements:

  • Huggingface transformers generation
  • Advanced mixed precision quantization
  • vLLM CLI benchmarking for detailed latency evaluation
  • lm-eval-harness integration for detailed accuracy evaluation
  • Upstream support in Transformers and vLLM
  • Compatibility with GGUF, MLC, Llama.cpp, Ollama, etc.
  • Broader model coverage (larger models, VLMs, VLAs)

License

  • Upstream: Meta Llama 3.2 License
  • Optimized Components: Embedl Models Community Licence v1.0 (no redistribution)

Contact

Enterprise & Commercial Inquiries sales@embedl.com

Technical Issues & Early Access https://github.com/embedl/embedl-models

More Information & Model Releases https://embedl.com


Partner & Developer Opportunities

If you are evaluating on-device inference, building products on SLMs, or exploring custom model optimization, reach out for:

  • Embedl SDK - AI optimization tools & profiling
  • Embedl HUB - benchmarking platform
  • Engineering support for on-prem/edge deployments
  • Migration guidance (Llama / Qwen / Gemma)
  • Early access & partner co-marketing opportunities

Contact: sales@embedl.com

Author: embedl

Likes: 3

Downloads: 0

Tags: safetensors, llama, text-generation-inference, base_model:meta-llama/Llama-3.2-1B-Instruct, base_model:quantized:meta-llama/Llama-3.2-1B-Instruct, license:other, compressed-tensors, region:us

embedl/Llama-3.2-1B-Instruct-FlashHead


license: other license_name: embedl-models-community-licence-1.0 license_link: https://github.com/embedl/embedl-models/blob/main/LICENSE base_model:

  • meta-llama/Llama-3.2-1B-Instruct tags:
  • text-generation-inference

Llama-3.2-1B-Instruct-FlashHead

My model banner

Optimized version of Llama-3.2-1B-Instruct using FlashHead, Embedl’s efficient replacement for the language model head, reducing size while preserving accuracy. Designed for low-latency inference on NVIDIA RTX GPUs, leveraging:

  • FlashHead
  • Custom vLLM generation via embedl-models

FlashHead matches the baseline Llama-3.2-1B-Instruct within rounding on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, achieves H200-class throughput on RTX Ada GPUs.


Model Details

| Field | Value | |------------|------------| | Base Model | Llama-3.2-1B-Instruct | | Input / Output | Text → Text | | Release Date | 2025-12-08 | | Version | 1.0 | | Optimizations | FlashHead LM Head| | Developers | Embedl | | Licenses | Upstream: Meta Llama 3.2 License. Built with Llama. <br>Optimized components: Embedl Models Community Licence v1.0 (no redistribution) | | Intended Use | Text generation, reasoning, assistant-style interaction, and general-purpose NLP on NVIDIA RTX GPUs |


Optimizations

  • FlashHead LM Head - lightweight replacement for the dense LM head, significantly improving throughput.
  • Custom Runtime Integration - compatible with vLLM (0.10.2) via the embedl-models package.

Performance

Token Generation Speed (RTX 3500 Ada, batch size = 1)

| Precision | Tokens/sec | Speedup vs BF16 | |----------------|----------------|----------------------| | BF16 baseline | 130 | 1.0× | | FlashHead (Embedl) | 163 | 1.25× | | W4A16 baseline | 278 | 2.14× | | FlashHead W4A16 (Embedl) | 485 | 3.73× |

FlashHead improves end-to-end speed by 1.75× over state-of-the-art, while maintaining full accuracy parity.

Measurement setup: vLLM 0.10.2, batch_size=1, prompt length=32, max_new_tokens=128, 10 warm-up runs, averaged over 100 runs.

NVIDIA H200 measurement: FP8, 512 Tokens/sec.


Accuracy (Parity with Baseline)

| Method | MMLU-Pro | HellaSwag | IFEval | BoolQ | BBH | TruthfulQA | GSM8K | |-------------|---------------|----------------|--------------|-------------|-------------|----------------|--------------| | Baseline | 0.18 | 0.59 | 0.45 | 0.69 | 0.38 | 0.36 | 0.46 | | FlashHead | 0.18 | 0.59 | 0.45 | 0.69 | 0.38 | 0.36 | 0.46 |

FlashHead closely matches baseline accuracy.


Installation

pip install embedl-models

The embedl-models package is required, it provides the optimized FlashHead implementation and quantized model runtime.


Usage Examples

Note (vLLM context length): max_model_len=131072 may fail on GPUs without enough free VRAM for the KV cache. If you see a KV cache memory error, lower max_model_len (or increase gpu_memory_utilization).

vLLM Inference

from vllm import SamplingParams
from embedl.models.vllm import LLM

model_id = "embedl/Llama-3.2-1B-Instruct-FlashHead"

if __name__ == "__main__":
    sampling = SamplingParams(max_tokens=128, temperature=0.0)
    llm = LLM(model=model_id, trust_remote_code=True, max_model_len=131072)
    
    prompt = "Write a haiku about coffee."
    output = llm.generate([prompt], sampling)
    print(output[0].outputs[0].text)

Interactive REPL Example

The run_repl() coroutine launches an interactive, streaming chat interface using the vLLM backend with FlashHead enabled.
It maintains an in-memory chat history and supports simple commands such as /exit to quit and /reset to clear context.

import asyncio
from embedl.models.vllm.demo import run_repl

model_id = "embedl/Llama-3.2-1B-Instruct-FlashHead"

if __name__ == "__main__":
    asyncio.run(
        run_repl(
            model=model_id,
            max_model_len=131072
        )
    )


⚠️ Important Warning: Hugging Face Transformers Support

FlashHead is currently not applied when using the Hugging Face transformers pipeline.
Generation through transformers will fall back to the standard dense LM head, disabling FlashHead acceleration.

For now, we strongly recommend using the vLLM integration (embedl.models.vllm.LLM) to ensure FlashHead is active and optimized for low-latency inference.

Full support for the Hugging Face transformers pipeline with FlashHead integration will be released in the coming days.


Limitations

  • Limited to vLLM 0.10.2 (pinned dependency)
  • Batch size = 1 (real-time generation)
  • Currently optimized for NVIDIA RTX GPUs

Roadmap

Planned improvements:

  • Advanced mixed precision quantization
  • Huggingface transformers generation
  • vLLM CLI benchmarking for detailed latency evaluation
  • lm-eval-harness integration for detailed accuracy evaluation
  • Upstream support in Transformers and vLLM
  • Compatibility with GGUF, MLC, Llama.cpp, Ollama, etc.
  • Broader model coverage (larger models, VLMs, VLAs)

License

  • Upstream: Meta Llama 3.2 License
  • Optimized Components: Embedl Models Community Licence v1.0 (no redistribution)

Contact

Enterprise & Commercial Inquiries sales@embedl.com

Technical Issues & Early Access https://github.com/embedl/embedl-models

More Information & Model Releases https://embedl.com


Partner & Developer Opportunities

If you are evaluating on-device inference, building products on SLMs, or exploring custom model optimization, reach out for:

  • Embedl SDK - AI optimization tools & profiling
  • Embedl HUB - benchmarking platform
  • Engineering support for on-prem/edge deployments
  • Migration guidance (Llama / Qwen / Gemma)
  • Early access & partner co-marketing opportunities

Contact: sales@embedl.com

Author: embedl

Likes: 3

Downloads: 0

Tags: safetensors, llama, text-generation-inference, base_model:meta-llama/Llama-3.2-1B-Instruct, base_model:finetune:meta-llama/Llama-3.2-1B-Instruct, license:other, region:us

QuixiAI/INTELLECT-3V

INTELLECT-3-V

A vision-language model created by grafting the language model weights from INTELLECT-3 into the GLM-4.6V architecture.

Motivation

INTELLECT-3 is a strong open-source language model, but lacks vision capabilities. GLM-4.6V is a vision-language model with an identical language model architecture. By replacing GLM-4.6V's language model weights with INTELLECT-3's weights while preserving the vision encoder and projection layers, we create a vision-language model powered by INTELLECT-3.

Architecture

Both models share the same language model backbone:

  • 46 transformer layers (layer 0 is dense MLP, layers 1-45 are MoE)
  • 4096 hidden dimension
  • 128 routed experts + shared experts per MoE layer
  • Grouped Query Attention (12288 q_proj, 1024 k/v_proj)
  • 151552 vocabulary size
  • BF16 weights

GLM-4.6V additionally includes:

  • 24-layer vision transformer (1536 hidden dim)
  • Visual merger projecting vision features to LLM hidden dimension
  • Downsampling convolution for spatial compression

What Was Grafted

The following weights were copied from INTELLECT-3 to GLM-4.6V:

| INTELLECT-3 | GLM-4.6V | |-------------|----------| | model.layers.* | model.language_model.layers.* | | model.norm.weight | model.language_model.norm.weight |

What Was Preserved (from GLM-4.6V)

  • model.language_model.embed_tokens.weight — kept to maintain vision token compatibility
  • lm_head.weight — kept aligned with embed_tokens
  • model.visual.* — entire vision encoder and merger preserved

Rationale

Why replace the final norm? The RMSNorm after the last transformer layer is tightly coupled to the layer outputs it normalizes. INTELLECT-3's norm was trained end-to-end with its layers and learned to normalize their specific output distribution.

Why keep embed_tokens? The vision merger projects visual features into the same embedding space as text tokens. Replacing embed_tokens could break the alignment between text and vision embeddings. Additionally, lm_head is often tied or co-trained with embed_tokens.

Why not replace lm_head? Same reasoning — keeping lm_head and embed_tokens together maintains their learned relationship.

Known Limitations

  1. Embedding space mismatch: INTELLECT-3's layers learned representations in a potentially different embedding space than GLM-4.6V. This may cause some degradation in both language and vision-language performance.

  2. Vision-language alignment: The visual merger was trained to project into GLM-4.6V's representation space. INTELLECT-3 may have learned different internal representations, potentially affecting vision-language tasks.

  3. Tokenizer compatibility: While both models have the same vocabulary size (151552), verify tokenizer compatibility for your use case.

Creation Script

The model was created using graft_intellect3_to_glm.py:

python graft_intellect3_to_glm.py \
    --intellect3 ~/models/INTELLECT-3 \
    --glm ~/models/GLM-4.6V \
    --output ~/models/INTELLECT-3-V

Source Model Architectures

INTELLECT-3

lm_head.weight,[151552,4096],BF16
model.embed_tokens.weight,[151552,4096],BF16
model.layers.0.mlp.down_proj.weight,[4096,10944],BF16
model.layers.0.mlp.gate_proj.weight,[10944,4096],BF16
model.layers.0.mlp.up_proj.weight,[10944,4096],BF16
model.layers.[0-45].input_layernorm.weight,[4096],BF16
model.layers.[0-45].post_attention_layernorm.weight,[4096],BF16
model.layers.[0-45].self_attn.k_proj.bias,[1024],BF16
model.layers.[0-45].self_attn.k_proj.weight,[1024,4096],BF16
model.layers.[0-45].self_attn.o_proj.weight,[4096,12288],BF16
model.layers.[0-45].self_attn.q_proj.bias,[12288],BF16
model.layers.[0-45].self_attn.q_proj.weight,[12288,4096],BF16
model.layers.[0-45].self_attn.v_proj.bias,[1024],BF16
model.layers.[0-45].self_attn.v_proj.weight,[1024,4096],BF16
model.layers.[1-45].mlp.experts.[0-127].down_proj.weight,[4096,1408],BF16
model.layers.[1-45].mlp.experts.[0-127].gate_proj.weight,[1408,4096],BF16
model.layers.[1-45].mlp.experts.[0-127].up_proj.weight,[1408,4096],BF16
model.layers.[1-45].mlp.gate.e_score_correction_bias,[128],F32
model.layers.[1-45].mlp.gate.weight,[128,4096],BF16
model.layers.[1-45].mlp.shared_experts.down_proj.weight,[4096,1408],BF16
model.layers.[1-45].mlp.shared_experts.gate_proj.weight,[1408,4096],BF16
model.layers.[1-45].mlp.shared_experts.up_proj.weight,[1408,4096],BF16
model.norm.weight,[4096],BF16

GLM-4.6V

lm_head.weight,[151552,4096],BF16
model.language_model.embed_tokens.weight,[151552,4096],BF16
model.language_model.layers.0.mlp.down_proj.weight,[4096,10944],BF16
model.language_model.layers.0.mlp.gate_proj.weight,[10944,4096],BF16
model.language_model.layers.0.mlp.up_proj.weight,[10944,4096],BF16
model.language_model.layers.[0-45].input_layernorm.weight,[4096],BF16
model.language_model.layers.[0-45].post_attention_layernorm.weight,[4096],BF16
model.language_model.layers.[0-45].self_attn.k_proj.bias,[1024],BF16
model.language_model.layers.[0-45].self_attn.k_proj.weight,[1024,4096],BF16
model.language_model.layers.[0-45].self_attn.o_proj.weight,[4096,12288],BF16
model.language_model.layers.[0-45].self_attn.q_proj.bias,[12288],BF16
model.language_model.layers.[0-45].self_attn.q_proj.weight,[12288,4096],BF16
model.language_model.layers.[0-45].self_attn.v_proj.bias,[1024],BF16
model.language_model.layers.[0-45].self_attn.v_proj.weight,[1024,4096],BF16
model.language_model.layers.[1-45].mlp.experts.[0-127].down_proj.weight,[4096,1408],BF16
model.language_model.layers.[1-45].mlp.experts.[0-127].gate_proj.weight,[1408,4096],BF16
model.language_model.layers.[1-45].mlp.experts.[0-127].up_proj.weight,[1408,4096],BF16
model.language_model.layers.[1-45].mlp.gate.e_score_correction_bias,[128],F32
model.language_model.layers.[1-45].mlp.gate.weight,[128,4096],BF16
model.language_model.layers.[1-45].mlp.shared_experts.down_proj.weight,[4096,1408],BF16
model.language_model.layers.[1-45].mlp.shared_experts.gate_proj.weight,[1408,4096],BF16
model.language_model.layers.[1-45].mlp.shared_experts.up_proj.weight,[1408,4096],BF16
model.language_model.norm.weight,[4096],BF16
model.visual.blocks.[0-23].attn.proj.weight,[1536,1536],BF16
model.visual.blocks.[0-23].attn.qkv.weight,[4608,1536],BF16
model.visual.blocks.[0-23].mlp.down_proj.weight,[1536,4096],BF16
model.visual.blocks.[0-23].mlp.gate_proj.weight,[4096,1536],BF16
model.visual.blocks.[0-23].mlp.up_proj.weight,[4096,1536],BF16
model.visual.blocks.[0-23].norm[1-2].weight,[1536],BF16
model.visual.downsample.bias,[4096],BF16
model.visual.downsample.weight,[4096,1536,2,2],BF16
model.visual.embeddings.position_embedding.weight,[576,1536],BF16
model.visual.merger.down_proj.weight,[4096,10944],BF16
model.visual.merger.gate_proj.weight,[10944,4096],BF16
model.visual.merger.post_projection_norm.bias,[4096],BF16
model.visual.merger.post_projection_norm.weight,[4096],BF16
model.visual.merger.proj.weight,[4096,4096],BF16
model.visual.merger.up_proj.weight,[10944,4096],BF16
model.visual.patch_embed.proj.bias,[1536],BF16
model.visual.patch_embed.proj.weight,[1536,3,2,14,14],BF16
model.visual.post_conv_layernorm.weight,[1536],BF16
model.visual.post_layernorm.weight,[1536],BF16

License

Please refer to the licenses of the source models:

Acknowledgments

Author: QuixiAI

Likes: 2

Downloads: 0

Tags: safetensors, glm4v_moe, region:us

AliceThirty/GLM-4.6V-gguf


tags:

  • gguf base_model: zai-org/GLM-4.6V

This is an experiment. Llama.cpp does not support GLM-4.5V and GLM-4.6V yet. I made llama.cpp believe that it's the GLM-4.5-Air architecture (so this model can only process text). It seems to have worked.

Author: AliceThirty

Likes: 2

Downloads: 0

Tags: gguf, base_model:zai-org/GLM-4.6V, base_model:quantized:zai-org/GLM-4.6V, endpoints_compatible, region:us, conversational