FutureMa/Qwen3-8B-Drama-Thinking

license: apache-2.0 base_model: Qwen/Qwen3-8B tags:

qwen3
thinking
creative-writing
screenwriting
drama
chain-of-thought
reasoning
ms-swift
full-parameter-finetuning datasets:
custom-drama-thinking-dataset language:
en
zh library_name: transformers pipeline_tag: text-generation model-index:
name: Qwen3-8B-Drama-Thinking results:
- task: type: text-generation name: Creative Script Writing metrics:
  - type: thinking_depth value: 9.0 name: Thinking Depth Score
  - type: script_format value: 9.0 name: Script Format Score
  - type: dramatic_craft value: 8.5 name: Dramatic Craft Score

Qwen3-8B-Drama-Thinking

This model is a full parameter fine-tuned version of Qwen/Qwen3-8B on a custom drama thinking dataset with explicit creative reasoning chains.

Model Description

Base Model: Qwen3-8B (8 billion parameters)
Training Method: Full Parameter Fine-tuning (NOT LoRA)
Training Framework: ms-swift
Training Data: Custom Drama Thinking Dataset (6,319 samples, avg ~5,000 tokens)
Specialization: Screenwriting with explicit <think>...</think> creative reasoning
Hardware: 2x NVIDIA H100 80GB SXM5
Training Time: 2 hours 46 minutes (3 epochs)
Training Cost: ~$17.86

Key Features

🎬 Professional Screenwriting Assistant

This model generates dramatic scripts with explicit creative deliberation:

✅ Thinking Process Visible: Uses <think>...</think> tags to show internal reasoning
✅ Deep Character Psychology: Analyzes motivations, defense mechanisms, subtext
✅ Structural Planning: Three-act structure, emotional arcs, pacing decisions
✅ Visual Storytelling: Symbolism, atmosphere, cinematographic choices
✅ Professional Format: Correct screenplay formatting (scene headers, action lines, dialogue)

📊 Performance Comparison

Compared to base Qwen3-8B:

| Metric | Base Model | Fine-Tuned | Improvement | |--------|------------|------------|-------------| | Output Length | 1,071 tokens | 3,874 tokens | +262% | | Thinking Depth | 5/10 | 9/10 | +80% | | Creative Reasoning | 500 tokens | 3,400 tokens | +580% | | Craft Analysis | Generic | Professional | Qualitative leap |

🎯 Unique Value Proposition

This is not just a text generator - it's a creative thinking partner that externalizes the entire screenwriting process: from title analysis to character psychology to structural planning to final execution.

Training Details

Training Configuration

Model:              Qwen/Qwen3-8B
Template:           qwen3_thinking
Training Type:      Full Parameter (all 8B parameters)
Max Length:         8192 tokens (for long thinking chains)
Batch Size:         1 per device × 2 GPUs
Gradient Accum:     8 steps (effective batch size: 16)
Learning Rate:      1e-5
Epochs:             3
Optimization:       DeepSpeed Zero3 + Gradient Checkpointing
                    Liger Kernel, BF16 mixed precision
Loss Scale:         ignore_empty_think
GPU Memory:         ~74.62 GB per H100 (stable)

Dataset Characteristics

Samples: 6,319 dramatic script continuations
Average Length: ~5,000 tokens per sample
Max Length: ~6,100 tokens
Format: Conversations with <think>...</think> reasoning tags
Content:
- Script opening scenes (title, description, initial dialogue)
- Extensive creative deliberation (3,000+ tokens of thinking)
- Script continuation with proper formatting
Style: Dramatic, emotionally intense scenarios (conflicts, reconciliation, tragedy)

Training Metrics

Final Loss: 0.844
Average Loss: 0.978
Loss Trajectory: 1.602 (start) → 0.82-0.83 (end)
Training Speed: ~8 seconds/iteration
Total Steps: 1,185
Checkpoints: 5 saved (400, 800, 900, 1000, 1185)

Usage

Quick Start (ms-swift)

# Install ms-swift
pip install ms-swift

# Inference (interactive mode)
swift infer \
    --ckpt_dir FutureMa/Qwen3-8B-Drama-Thinking \
    --template qwen3_thinking \
    --max_new_tokens 4096 \
    --temperature 0.7

Python API

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

from swift.llm import PtEngine, InferRequest, RequestConfig

# Initialize engine
engine = PtEngine(
    model_id_or_path="FutureMa/Qwen3-8B-Drama-Thinking",
    max_batch_size=1,
    torch_dtype="bfloat16"
)

# Create prompt
prompt = """Title: The Reunion
Description: Two estranged siblings meet at their childhood home after 20 years.

INT. FAMILY LIVING ROOM - DAY

SARAH (35) stands by the window, looking out at the garden.
MICHAEL (38) enters, hesitant."""

# Generate (streaming)
messages = [{'role': 'user', 'content': prompt}]
request = InferRequest(messages=messages)
config = RequestConfig(max_tokens=4096, temperature=0.7, stream=True)

for response in engine.infer([request], config)[0]:
    if response:
        print(response.choices[0].delta.content, end='', flush=True)

Using with Transformers (requires adapting checkpoint)

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "FutureMa/Qwen3-8B-Drama-Thinking",
    torch_dtype="bfloat16",
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")

messages = [
    {"role": "system", "content": "You are a creative screenwriter assistant specialized in dramatic storytelling with internal reasoning."},
    {"role": "user", "content": "Write a scene about forgiveness..."}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=4096, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Example Output

Input Prompt:

Title: The Last Apology
Description: A daughter arrives at her estranged father's deathbed...

Model Output (excerpt):

<think>
Okay. Let's go. I'm sitting here with the opening of *The Last Apology* fresh in my mind,
and I need to figure out exactly where this goes next—not just the next scene, but the next
emotional beat, the next structural pivot...

First, the title: *The Last Apology*. That's not just poetic—it's structural. It tells me
this story is about delayed recognition, about the finality of words left unsaid...

[3,400 tokens of deep creative analysis including:]
- Title deconstruction and thematic implications
- Character psychology analysis
- Three-act structure planning
- Visual language and symbolism
- Multiple narrative paths considered
- Professional screenwriting techniques
</think>

INT. HOSPITAL ROOM - NIGHT

ANNA (28), in a wrinkled business suit, hesitates at the doorway.

DAVID (65) lies in bed, breathing labored...

[Script continues with proper formatting]

Intended Use

✅ Recommended Use Cases

Screenwriting Education: Learn professional creative thinking process
Script Ideation: Generate story frameworks and narrative alternatives
Story Consulting: Explore "what if" scenarios with explicit reasoning
Creative Brainstorming: Understand decision-making in storytelling
Draft Development: Plan structure before execution

❌ Not Recommended For

Final Shooting Scripts: Requires human refinement for production
Comedy/Action Genres: Training bias toward dramatic content
Long-form Series: Single-pass generation may lack consistency
Immediate Production: Dialogue needs naturalization

Evaluation Results

Quantitative Metrics (vs. Base Model)

| Aspect | Score | Base Model | Improvement | |--------|-------|------------|-------------| | Thinking Depth | 9/10 | 5/10 | +80% | | Script Format | 9/10 | 8/10 | +13% | | Dramatic Craft | 8.5/10 | 8/10 | +6% | | Character Psychology | 9/10 | 6/10 | +50% | | Decision Transparency | 9/10 | 5/10 | +80% | | Overall | 8.1/10 | 6.9/10 | +17% |

Qualitative Improvements

✅ Professional Voice: Sounds like experienced screenwriter
✅ Structural Thinking: Explicit three-act planning
✅ Meta-Awareness: "This isn't just a script. It's a reckoning."
✅ Non-Linear Reasoning: Considers alternatives, backtracks, refines
✅ Craft-Oriented: Explains why choices serve the story

Limitations

Thinking Verbosity: Generates ~3,400 tokens of thinking (87% of output)
- May be excessive for quick tasks
- Consider using max_new_tokens to limit length
Incomplete Execution: Token budget consumed by thinking
- Many planned scenes not fully generated
- May need 6,000-8,000 token limit for complete scripts
Dialogue Naturalness: More direct/literary than conversational
- Training data style influences output
- May need post-processing for natural speech
Training Data Bias: Skews toward melodramatic scenarios
- Less suited for subtle/realistic dialogue
- Best for emotionally intense stories

Training Insights

What Made This Successful

8192 Token Context: Essential for capturing full thinking chains
- Initial assumption of 2048 would have truncated data
- Average sample length: ~5,000 tokens
DeepSpeed Zero3: Required (not optional)
- Single H100: Would need ~109-114 GB (OOM)
- Zero3 sharding: ~74.62 GB per card ✅
Full Parameter Training: Worth the cost
- Deeper capability transfer than LoRA
- Better thinking process internalization
- Cost: $17.86 (2.8 hours) vs ~$5 for LoRA
Quality Training Data: 6,319 long-form reasoning examples
- Actual creative process in <think> tags
- High-quality dramatic writing

Citation

@misc{qwen3-drama-thinking-2025,
  author = {FutureMa},
  title = {Qwen3-8B-Drama-Thinking: Full Parameter Fine-tuning for Creative Screenwriting},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/FutureMa/Qwen3-8B-Drama-Thinking}},
  note = {Full parameter fine-tuning on 6,319 drama samples with explicit reasoning chains}
}

Acknowledgments

Base Model: Qwen Team - Qwen3-8B
Training Framework: ms-swift - ModelScope SWIFT
Infrastructure: Lambda Cloud - 2x H100 80GB SXM5
Dataset: Custom Drama Thinking Dataset (6,319 samples)

Model Card Contact

For questions or feedback:

HuggingFace: @FutureMa
GitHub Issues: Report via ms-swift repository

Training Date: 2025-12-08 Training Duration: 2h 46m Model Size: ~16GB (BF16 precision) Recommended VRAM: 16GB+ for inference

Author: FutureMa

Likes: 12

Downloads: 0

Tags: transformers, safetensors, qwen3, text-generation, thinking, creative-writing, screenwriting, drama, chain-of-thought, reasoning, ms-swift, full-parameter-finetuning, conversational, en, zh, dataset:custom-drama-thinking-dataset, base_model:Qwen/Qwen3-8B, base_model:finetune:Qwen/Qwen3-8B, license:apache-2.0, model-index, text-generation-inference, endpoints_compatible, region:us

mradermacher/GLM-4.6V-Flash-GGUF

base_model: zai-org/GLM-4.6V-Flash language:

zh
en library_name: transformers license: mit mradermacher: readme_rev: 1 quantized_by: mradermacher

About

static quants of https://huggingface.co/zai-org/GLM-4.6V-Flash

For a convenient overview and download list, visit our model page for this model.

weighted/imatrix quants are available at https://huggingface.co/mradermacher/GLM-4.6V-Flash-i1-GGUF

Usage

If you are unsure how to use GGUF files, refer to one of TheBloke's READMEs for more details, including on how to concatenate multi-part files.

Provided Quants

(sorted by size, not necessarily quality. IQ-quants are often preferable over similar sized non-IQ quants)

| Link | Type | Size/GB | Notes | |:-----|:-----|--------:|:------| | GGUF | Q2_K | 4.1 | | | GGUF | Q3_K_S | 4.7 | | | GGUF | Q3_K_M | 5.1 | lower quality | | GGUF | Q3_K_L | 5.3 | | | GGUF | IQ4_XS | 5.4 | | | GGUF | Q4_K_S | 5.9 | fast, recommended | | GGUF | Q4_K_M | 6.3 | fast, recommended | | GGUF | Q5_K_S | 6.8 | | | GGUF | Q5_K_M | 7.2 | | | GGUF | Q6_K | 8.4 | very good quality | | GGUF | Q8_0 | 10.1 | fast, best quality | | GGUF | f16 | 18.9 | 16 bpw, overkill |

Here is a handy graph by ikawrakow comparing some lower-quality quant types (lower is better):

And here are Artefact2's thoughts on the matter: https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9

FAQ / Model Request

See https://huggingface.co/mradermacher/model_requests for some answers to questions you might have and/or if you want some other model quantized.

Thanks

I thank my company, nethype GmbH, for letting me use its servers and providing upgrades to my workstation to enable this work in my free time.

Author: mradermacher

Likes: 11

Downloads: 0

Tags: transformers, gguf, zh, en, base_model:zai-org/GLM-4.6V-Flash, base_model:quantized:zai-org/GLM-4.6V-Flash, license:mit, endpoints_compatible, region:us, conversational

zai-org/AutoGLM-Phone-9B

license: mit language:

zh base_model:
zai-org/GLM-4.1V-9B-Base pipeline_tag: image-text-to-text tags:
agent

AutoGLM-Phone-9B

<div align="center"> <img src="https://raw.githubusercontent.com/zai-org/Open-AutoGLM/refs/heads/main/resources/logo.svg" width="20%"/> </div> <p align="center"> 👋 Join our <a href="https://raw.githubusercontent.com/zai-org/Open-AutoGLM/refs/heads/main/resources/WECHAT.md" target="_blank">WeChat</a> community </p>

⚠️ This project is intended for research and educational purposes only.
Any use for illegal data access, system interference, or unlawful activities is strictly prohibited.
Please review our Terms of Use carefully.

Project Overview

Phone Agent is a mobile intelligent assistant framework built on AutoGLM, capable of understanding smartphone screens through multimodal perception and executing automated operations to complete tasks.
The system controls devices via ADB (Android Debug Bridge), uses a vision-language model for screen understanding, and leverages intelligent planning to generate and execute action sequences.

Users can simply describe tasks in natural language—for example, “Open Xiaohongshu and search for food recommendations.”
Phone Agent will automatically parse the intent, understand the current UI, plan the next steps, and carry out the entire workflow.

The system also includes:

Sensitive action confirmation mechanisms
Human-in-the-loop fallback for login or verification code scenarios
Remote ADB debugging, allowing device connection via WiFi or network for flexible remote control and development

Model Usage

We provide an open-source model usage guide to help you quickly download and deploy the model.
Please visit our GitHub for detailed instructions.

The model architecture is identical to GLM-4.1V-9B-Thinking.
For deployment details, see the GLM-V repository.

Citation

If you find our work helpful, please cite the following paper:

@article{liu2024autoglm,
  title={Autoglm: Autonomous foundation agents for guis},
  author={Liu, Xiao and Qin, Bo and Liang, Dongzhu and Dong, Guang and Lai, Hanyu and Zhang, Hanchen and Zhao, Hanlin and Iong, Iat Long and Sun, Jiadai and Wang, Jiaqi and others},
  journal={arXiv preprint arXiv:2411.00820},
  year={2024}
}

Author: zai-org

Likes: 9

Downloads: 0

Tags: safetensors, glm4v, agent, image-text-to-text, conversational, zh, arxiv:2411.00820, base_model:zai-org/GLM-4.1V-9B-Base, base_model:finetune:zai-org/GLM-4.1V-9B-Base, license:mit, region:us

turboderp/GLM-4.6V-exl3

license: mit base_model: zai-org/GLM-4.6V base_model_relation: quantized quantized_by: turboderp tags:

exl3

EXL3 quants of GLM-4.6V

⚠️ Requires ExLlamaV3 v0.0.18 (or v0.0.17 dev branch)

Base bitrates:

4.00 bits per weight
(more to come)

Author: turboderp

Likes: 3

Downloads: 0

Tags: exl3, base_model:zai-org/GLM-4.6V, base_model:quantized:zai-org/GLM-4.6V, license:mit, region:us

embedl/Llama-3.2-3B-Instruct-FlashHead

license: other license_name: embedl-models-community-licence-1.0 license_link: https://github.com/embedl/embedl-models/blob/main/LICENSE base_model:

meta-llama/Llama-3.2-3B-Instruct tags:
text-generation-inference

Llama-3.2-3B-Instruct-FlashHead

My model banner

Optimized version of Llama-3.2-3B-Instruct using FlashHead, Embedl’s efficient replacement for the language model head, reducing size while preserving accuracy. Designed for low-latency inference on NVIDIA RTX GPUs, leveraging:

FlashHead
Custom vLLM generation via embedl-models

FlashHead matches the Llama-3.2-3B-Instruct baseline within rounding error on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, delivers SOTA on-device latency

Model Details

| Field | Value | |------------|------------| | Base Model | Llama-3.2-3B-Instruct | | Input / Output | Text → Text | | Release Date | 2025-12-08 | | Version | 1.0 | | Optimizations | FlashHead LM Head| | Developers | Embedl | | Licenses | Upstream: Meta Llama 3.2 License. Built with Llama. <br>Optimized components: Embedl Models Community Licence v1.0 (no redistribution) | | Intended Use | Text generation, reasoning, assistant-style interaction, and general-purpose NLP on NVIDIA RTX GPUs |

Optimizations

FlashHead LM Head - lightweight replacement for the dense LM head, significantly improving throughput.
Custom Runtime Integration - compatible with vLLM (0.10.2) via the embedl-models package.

Performance

Token Generation Speed (RTX 3500 Ada, batch size = 1)

| Precision | Tokens/sec | Speedup vs BF16 | |----------------|----------------|----------------------| | BF16 baseline | 54 | 1.0× | | FlashHead (Embedl) | 58 | 1.07× | | W4A16 baseline | 141 | 2.61× | | FlashHead W4A16 (Embedl) | 177 | 3.28× |

FlashHead improves end-to-end speed by 1.26× over state-of-the-art, while maintaining full accuracy parity.

Accuracy (Parity with Baseline)

| Method | MMLU-Pro | IFEval | BBH | TruthfulQA | GSM8K | |-------------|---------------|-------------|-------------|----------------|--------------| | Baseline | 0.31 | 0.57 | 0.57 | 0.57 | 0.77 | | FlashHead | 0.31 | 0.56 | 0.57 | 0.58 | 0.77 |

FlashHead closely matches baseline accuracy.

Installation

pip install embedl-models

The embedl-models package is required, it provides the optimized FlashHead implementation and quantized model runtime.

Usage Examples

Note (vLLM context length): max_model_len=131072 may fail on GPUs without enough free VRAM for the KV cache. If you see a KV cache memory error, lower max_model_len (or increase gpu_memory_utilization).

vLLM Inference

from vllm import SamplingParams
from embedl.models.vllm import LLM

model_id = "embedl/Llama-3.2-3B-Instruct-FlashHead"

if __name__ == "__main__":
    sampling = SamplingParams(max_tokens=128, temperature=0.0)
    llm = LLM(model=model_id, trust_remote_code=True, max_model_len=131072)
    
    prompt = "Write a haiku about coffee."
    output = llm.generate([prompt], sampling)
    print(output[0].outputs[0].text)

Interactive REPL Example

The run_repl() coroutine launches an interactive, streaming chat interface using the vLLM backend with FlashHead enabled.
It maintains an in-memory chat history and supports simple commands such as /exit to quit and /reset to clear context.

import asyncio
from embedl.models.vllm.demo import run_repl

model_id = "embedl/Llama-3.2-3B-Instruct-FlashHead"

if __name__ == "__main__":
    asyncio.run(
        run_repl(
            model=model_id,
            max_model_len=131072
        )
    )

⚠️ Important Warning: Hugging Face Transformers Support

FlashHead is currently not applied when using the Hugging Face transformers pipeline.
Generation through transformers will fall back to the standard dense LM head, disabling FlashHead acceleration.

For now, we strongly recommend using the vLLM integration (embedl.models.vllm.LLM) to ensure FlashHead is active and optimized for low-latency inference.

Full support for the Hugging Face transformers pipeline with FlashHead integration will be released in the coming days.

Limitations

Limited to vLLM 0.10.2 (pinned dependency)
Batch size = 1 (real-time generation)
Currently optimized for NVIDIA RTX GPUs

Roadmap

Planned improvements:

Advanced mixed precision quantization
Huggingface transformers generation
vLLM CLI benchmarking for detailed latency evaluation
lm-eval-harness integration for detailed accuracy evaluation
Upstream support in Transformers and vLLM
Compatibility with GGUF, MLC, Llama.cpp, Ollama, etc.
Broader model coverage (larger models, VLMs, VLAs)

License

Upstream: Meta Llama 3.2 License
Optimized Components: Embedl Models Community Licence v1.0 (no redistribution)

Contact

Enterprise & Commercial Inquiries sales@embedl.com

Technical Issues & Early Access https://github.com/embedl/embedl-models

More Information & Model Releases https://embedl.com

Partner & Developer Opportunities

If you are evaluating on-device inference, building products on SLMs, or exploring custom model optimization, reach out for:

Embedl SDK - AI optimization tools & profiling
Embedl HUB - benchmarking platform
Engineering support for on-prem/edge deployments
Migration guidance (Llama / Qwen / Gemma)
Early access & partner co-marketing opportunities

Contact: sales@embedl.com

Author: embedl

Likes: 3

Downloads: 0

Tags: safetensors, llama, text-generation-inference, base_model:meta-llama/Llama-3.2-3B-Instruct, base_model:finetune:meta-llama/Llama-3.2-3B-Instruct, license:other, region:us

embedl/Llama-3.2-3B-Instruct-FlashHead-W4A16

license: other license_name: embedl-models-community-licence-1.0 license_link: https://github.com/embedl/embedl-models/blob/main/LICENSE base_model:

meta-llama/Llama-3.2-3B-Instruct tags:
text-generation-inference

Llama-3.2-3B-Instruct-FlashHead-W4A16

My model banner

Optimized version of Llama-3.2-3B-Instruct using Quantization and FlashHead, Embedl’s efficient replacement for the language model head, reducing size while preserving accuracy. Designed for low-latency inference on NVIDIA RTX GPUs, leveraging:

FlashHead
Quantization (W4A16)
Custom vLLM generation via embedl-models

FlashHead matches the Llama-3.2-3B-Instruct baseline within rounding error on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, delivers SOTA on-device latency

Model Details

| Field | Value | |------------|------------| | Base Model | Llama-3.2-3B-Instruct | | Input / Output | Text → Text | | Release Date | 2025-12-08 | | Version | 1.0 | | Optimizations | FlashHead LM Head, Quantization (W4A16)| | Developers | Embedl | | Licenses | Upstream: Meta Llama 3.2 License. Built with Llama. <br>Optimized components: Embedl Models Community Licence v1.0 (no redistribution) | | Intended Use | Text generation, reasoning, assistant-style interaction, and general-purpose NLP on NVIDIA RTX GPUs |

Optimizations

FlashHead LM Head - lightweight replacement for the dense LM head, significantly improving throughput.
Quantization (W4A16) - large reduction in memory footprint and latency.
Custom Runtime Integration - compatible with vLLM (0.10.2) via the embedl-models package.

Performance

Token Generation Speed (RTX 3500 Ada, batch size = 1)

FlashHead improves end-to-end speed by 1.26× over state-of-the-art, while maintaining full accuracy parity.

Accuracy (Parity with Baseline)

FlashHead closely matches baseline accuracy.

Installation

pip install embedl-models

The embedl-models package is required, it provides the optimized FlashHead implementation and quantized model runtime.

Usage Examples

vLLM Inference

from vllm import SamplingParams
from embedl.models.vllm import LLM

model_id = "embedl/Llama-3.2-3B-Instruct-FlashHead-W4A16"

if __name__ == "__main__":
    sampling = SamplingParams(max_tokens=128, temperature=0.0)
    llm = LLM(model=model_id, trust_remote_code=True, max_model_len=131072)
    
    prompt = "Write a haiku about coffee."
    output = llm.generate([prompt], sampling)
    print(output[0].outputs[0].text)

Interactive REPL Example

import asyncio
from embedl.models.vllm.demo import run_repl

model_id = "embedl/Llama-3.2-3B-Instruct-FlashHead-W4A16"

if __name__ == "__main__":
    asyncio.run(
        run_repl(
            model=model_id,
            max_model_len=131072
        )
    )

⚠️ Important Warning: Hugging Face Transformers Support

FlashHead is currently not applied when using the Hugging Face transformers pipeline.
Generation through transformers will fall back to the standard dense LM head, disabling FlashHead acceleration.

For now, we strongly recommend using the vLLM integration (embedl.models.vllm.LLM) to ensure FlashHead is active and optimized for low-latency inference.

Full support for the Hugging Face transformers pipeline with FlashHead integration will be released in the coming days.

Limitations

Limited to vLLM 0.10.2 (pinned dependency)
Batch size = 1 (real-time generation)
Currently optimized for NVIDIA RTX GPUs

Roadmap

Planned improvements:

Huggingface transformers generation
Advanced mixed precision quantization
vLLM CLI benchmarking for detailed latency evaluation
lm-eval-harness integration for detailed accuracy evaluation
Upstream support in Transformers and vLLM
Compatibility with GGUF, MLC, Llama.cpp, Ollama, etc.
Broader model coverage (larger models, VLMs, VLAs)

License

Upstream: Meta Llama 3.2 License
Optimized Components: Embedl Models Community Licence v1.0 (no redistribution)

Contact

Enterprise & Commercial Inquiries sales@embedl.com

Technical Issues & Early Access https://github.com/embedl/embedl-models

More Information & Model Releases https://embedl.com

Partner & Developer Opportunities

If you are evaluating on-device inference, building products on SLMs, or exploring custom model optimization, reach out for:

Embedl SDK - AI optimization tools & profiling
Embedl HUB - benchmarking platform
Engineering support for on-prem/edge deployments
Migration guidance (Llama / Qwen / Gemma)
Early access & partner co-marketing opportunities

Contact: sales@embedl.com

Author: embedl

Likes: 3

Downloads: 0

Tags: safetensors, llama, text-generation-inference, base_model:meta-llama/Llama-3.2-3B-Instruct, base_model:quantized:meta-llama/Llama-3.2-3B-Instruct, license:other, compressed-tensors, region:us

embedl/Llama-3.2-1B-Instruct-FlashHead-W4A16

license: other license_name: embedl-models-community-licence-1.0 license_link: https://github.com/embedl/embedl-models/blob/main/LICENSE base_model:

meta-llama/Llama-3.2-1B-Instruct tags:
text-generation-inference

Llama-3.2-1B-Instruct-FlashHead-W4A16

My model banner

Optimized version of Llama-3.2-1B-Instruct using Quantization and FlashHead, Embedl’s efficient replacement for the language model head, reducing size while preserving accuracy. Designed for low-latency inference on NVIDIA RTX GPUs, leveraging:

FlashHead
Quantization (W4A16)
Custom vLLM generation via embedl-models

FlashHead matches the baseline Llama-3.2-1B-Instruct within rounding on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, achieves H200-class throughput on RTX Ada GPUs.

Model Details

| Field | Value | |------------|------------| | Base Model | Llama-3.2-1B-Instruct | | Input / Output | Text → Text | | Release Date | 2025-12-08 | | Version | 1.0 | | Optimizations | FlashHead LM Head, Quantization (W4A16)| | Developers | Embedl | | Licenses | Upstream: Meta Llama 3.2 License. Built with Llama. <br>Optimized components: Embedl Models Community Licence v1.0 (no redistribution) | | Intended Use | Text generation, reasoning, assistant-style interaction, and general-purpose NLP on NVIDIA RTX GPUs |

Optimizations

FlashHead LM Head - lightweight replacement for the dense LM head, significantly improving throughput.
Quantization (W4A16) - large reduction in memory footprint and latency.
Custom Runtime Integration - compatible with vLLM (0.10.2) via the embedl-models package.

Performance

Token Generation Speed (RTX 3500 Ada, batch size = 1)

| Precision | Tokens/sec | Speedup vs BF16 | |----------------|----------------|----------------------| | BF16 baseline | 130 | 1.0× | | FlashHead (Embedl) | 163 | 1.25× | | W4A16 baseline | 278 | 2.14× | | FlashHead W4A16 (Embedl) | 485 | 3.73× |

FlashHead improves end-to-end speed by 1.75× over state-of-the-art, while maintaining full accuracy parity.

Measurement setup: vLLM 0.10.2, batch_size=1, prompt length=32, max_new_tokens=128, 10 warm-up runs, averaged over 100 runs.

NVIDIA H200 measurement: FP8, 512 Tokens/sec.

Accuracy (Parity with Baseline)

| Method | MMLU-Pro | HellaSwag | IFEval | BoolQ | BBH | TruthfulQA | GSM8K | |-------------|---------------|----------------|--------------|-------------|-------------|----------------|--------------| | Baseline | 0.18 | 0.59 | 0.45 | 0.69 | 0.38 | 0.36 | 0.46 | | FlashHead | 0.18 | 0.59 | 0.45 | 0.69 | 0.38 | 0.36 | 0.46 |

FlashHead closely matches baseline accuracy.

Installation

pip install embedl-models

The embedl-models package is required, it provides the optimized FlashHead implementation and quantized model runtime.

Usage Examples

vLLM Inference

from vllm import SamplingParams
from embedl.models.vllm import LLM

model_id = "embedl/Llama-3.2-1B-Instruct-FlashHead-W4A16"

if __name__ == "__main__":
    sampling = SamplingParams(max_tokens=128, temperature=0.0)
    llm = LLM(model=model_id, trust_remote_code=True, max_model_len=131072)
    
    prompt = "Write a haiku about coffee."
    output = llm.generate([prompt], sampling)
    print(output[0].outputs[0].text)

Interactive REPL Example

import asyncio
from embedl.models.vllm.demo import run_repl

model_id = "embedl/Llama-3.2-1B-Instruct-FlashHead-W4A16"

if __name__ == "__main__":
    asyncio.run(
        run_repl(
            model=model_id,
            max_model_len=131072
        )
    )

⚠️ Important Warning: Hugging Face Transformers Support

FlashHead is currently not applied when using the Hugging Face transformers pipeline.
Generation through transformers will fall back to the standard dense LM head, disabling FlashHead acceleration.

For now, we strongly recommend using the vLLM integration (embedl.models.vllm.LLM) to ensure FlashHead is active and optimized for low-latency inference.

Full support for the Hugging Face transformers pipeline with FlashHead integration will be released in the coming days.

Limitations

Limited to vLLM 0.10.2 (pinned dependency)
Batch size = 1 (real-time generation)
Currently optimized for NVIDIA RTX GPUs

Roadmap

Planned improvements:

Huggingface transformers generation
Advanced mixed precision quantization
vLLM CLI benchmarking for detailed latency evaluation
lm-eval-harness integration for detailed accuracy evaluation
Upstream support in Transformers and vLLM
Compatibility with GGUF, MLC, Llama.cpp, Ollama, etc.
Broader model coverage (larger models, VLMs, VLAs)

License

Upstream: Meta Llama 3.2 License
Optimized Components: Embedl Models Community Licence v1.0 (no redistribution)

Contact

Enterprise & Commercial Inquiries sales@embedl.com

Technical Issues & Early Access https://github.com/embedl/embedl-models

More Information & Model Releases https://embedl.com

Partner & Developer Opportunities

If you are evaluating on-device inference, building products on SLMs, or exploring custom model optimization, reach out for:

Embedl SDK - AI optimization tools & profiling
Embedl HUB - benchmarking platform
Engineering support for on-prem/edge deployments
Migration guidance (Llama / Qwen / Gemma)
Early access & partner co-marketing opportunities

Contact: sales@embedl.com

Author: embedl

Likes: 3

Downloads: 0

Tags: safetensors, llama, text-generation-inference, base_model:meta-llama/Llama-3.2-1B-Instruct, base_model:quantized:meta-llama/Llama-3.2-1B-Instruct, license:other, compressed-tensors, region:us

embedl/Llama-3.2-1B-Instruct-FlashHead

license: other license_name: embedl-models-community-licence-1.0 license_link: https://github.com/embedl/embedl-models/blob/main/LICENSE base_model:

meta-llama/Llama-3.2-1B-Instruct tags:
text-generation-inference

Llama-3.2-1B-Instruct-FlashHead

My model banner

Optimized version of Llama-3.2-1B-Instruct using FlashHead, Embedl’s efficient replacement for the language model head, reducing size while preserving accuracy. Designed for low-latency inference on NVIDIA RTX GPUs, leveraging:

FlashHead
Custom vLLM generation via embedl-models

Model Details

| Field | Value | |------------|------------| | Base Model | Llama-3.2-1B-Instruct | | Input / Output | Text → Text | | Release Date | 2025-12-08 | | Version | 1.0 | | Optimizations | FlashHead LM Head| | Developers | Embedl | | Licenses | Upstream: Meta Llama 3.2 License. Built with Llama. <br>Optimized components: Embedl Models Community Licence v1.0 (no redistribution) | | Intended Use | Text generation, reasoning, assistant-style interaction, and general-purpose NLP on NVIDIA RTX GPUs |

Optimizations

FlashHead LM Head - lightweight replacement for the dense LM head, significantly improving throughput.
Custom Runtime Integration - compatible with vLLM (0.10.2) via the embedl-models package.

Performance

Token Generation Speed (RTX 3500 Ada, batch size = 1)

FlashHead improves end-to-end speed by 1.75× over state-of-the-art, while maintaining full accuracy parity.

Measurement setup: vLLM 0.10.2, batch_size=1, prompt length=32, max_new_tokens=128, 10 warm-up runs, averaged over 100 runs.

NVIDIA H200 measurement: FP8, 512 Tokens/sec.

Accuracy (Parity with Baseline)

FlashHead closely matches baseline accuracy.

Installation

pip install embedl-models

The embedl-models package is required, it provides the optimized FlashHead implementation and quantized model runtime.

Usage Examples

vLLM Inference

from vllm import SamplingParams
from embedl.models.vllm import LLM

model_id = "embedl/Llama-3.2-1B-Instruct-FlashHead"

if __name__ == "__main__":
    sampling = SamplingParams(max_tokens=128, temperature=0.0)
    llm = LLM(model=model_id, trust_remote_code=True, max_model_len=131072)
    
    prompt = "Write a haiku about coffee."
    output = llm.generate([prompt], sampling)
    print(output[0].outputs[0].text)

Interactive REPL Example

import asyncio
from embedl.models.vllm.demo import run_repl

model_id = "embedl/Llama-3.2-1B-Instruct-FlashHead"

if __name__ == "__main__":
    asyncio.run(
        run_repl(
            model=model_id,
            max_model_len=131072
        )
    )

⚠️ Important Warning: Hugging Face Transformers Support

FlashHead is currently not applied when using the Hugging Face transformers pipeline.
Generation through transformers will fall back to the standard dense LM head, disabling FlashHead acceleration.

For now, we strongly recommend using the vLLM integration (embedl.models.vllm.LLM) to ensure FlashHead is active and optimized for low-latency inference.

Full support for the Hugging Face transformers pipeline with FlashHead integration will be released in the coming days.

Limitations

Limited to vLLM 0.10.2 (pinned dependency)
Batch size = 1 (real-time generation)
Currently optimized for NVIDIA RTX GPUs

Roadmap

Planned improvements:

Advanced mixed precision quantization
Huggingface transformers generation
vLLM CLI benchmarking for detailed latency evaluation
lm-eval-harness integration for detailed accuracy evaluation
Upstream support in Transformers and vLLM
Compatibility with GGUF, MLC, Llama.cpp, Ollama, etc.
Broader model coverage (larger models, VLMs, VLAs)

License

Upstream: Meta Llama 3.2 License
Optimized Components: Embedl Models Community Licence v1.0 (no redistribution)

Contact

Enterprise & Commercial Inquiries sales@embedl.com

Technical Issues & Early Access https://github.com/embedl/embedl-models

More Information & Model Releases https://embedl.com

Partner & Developer Opportunities

If you are evaluating on-device inference, building products on SLMs, or exploring custom model optimization, reach out for:

Embedl SDK - AI optimization tools & profiling
Embedl HUB - benchmarking platform
Engineering support for on-prem/edge deployments
Migration guidance (Llama / Qwen / Gemma)
Early access & partner co-marketing opportunities

Contact: sales@embedl.com

Author: embedl

Likes: 3

Downloads: 0

Tags: safetensors, llama, text-generation-inference, base_model:meta-llama/Llama-3.2-1B-Instruct, base_model:finetune:meta-llama/Llama-3.2-1B-Instruct, license:other, region:us

QuixiAI/INTELLECT-3V

INTELLECT-3-V

A vision-language model created by grafting the language model weights from INTELLECT-3 into the GLM-4.6V architecture.

Motivation

INTELLECT-3 is a strong open-source language model, but lacks vision capabilities. GLM-4.6V is a vision-language model with an identical language model architecture. By replacing GLM-4.6V's language model weights with INTELLECT-3's weights while preserving the vision encoder and projection layers, we create a vision-language model powered by INTELLECT-3.

Architecture

Both models share the same language model backbone:

46 transformer layers (layer 0 is dense MLP, layers 1-45 are MoE)
4096 hidden dimension
128 routed experts + shared experts per MoE layer
Grouped Query Attention (12288 q_proj, 1024 k/v_proj)
151552 vocabulary size
BF16 weights

GLM-4.6V additionally includes:

24-layer vision transformer (1536 hidden dim)
Visual merger projecting vision features to LLM hidden dimension
Downsampling convolution for spatial compression

What Was Grafted

The following weights were copied from INTELLECT-3 to GLM-4.6V:

| INTELLECT-3 | GLM-4.6V | |-------------|----------| | model.layers.* | model.language_model.layers.* | | model.norm.weight | model.language_model.norm.weight |

What Was Preserved (from GLM-4.6V)

model.language_model.embed_tokens.weight — kept to maintain vision token compatibility
lm_head.weight — kept aligned with embed_tokens
model.visual.* — entire vision encoder and merger preserved

Rationale

Why replace the final norm? The RMSNorm after the last transformer layer is tightly coupled to the layer outputs it normalizes. INTELLECT-3's norm was trained end-to-end with its layers and learned to normalize their specific output distribution.

Why keep embed_tokens? The vision merger projects visual features into the same embedding space as text tokens. Replacing embed_tokens could break the alignment between text and vision embeddings. Additionally, lm_head is often tied or co-trained with embed_tokens.

Why not replace lm_head? Same reasoning — keeping lm_head and embed_tokens together maintains their learned relationship.

Known Limitations

Embedding space mismatch: INTELLECT-3's layers learned representations in a potentially different embedding space than GLM-4.6V. This may cause some degradation in both language and vision-language performance.
Vision-language alignment: The visual merger was trained to project into GLM-4.6V's representation space. INTELLECT-3 may have learned different internal representations, potentially affecting vision-language tasks.
Tokenizer compatibility: While both models have the same vocabulary size (151552), verify tokenizer compatibility for your use case.

Creation Script

The model was created using graft_intellect3_to_glm.py:

python graft_intellect3_to_glm.py \
    --intellect3 ~/models/INTELLECT-3 \
    --glm ~/models/GLM-4.6V \
    --output ~/models/INTELLECT-3-V

Source Model Architectures

INTELLECT-3

lm_head.weight,[151552,4096],BF16
model.embed_tokens.weight,[151552,4096],BF16
model.layers.0.mlp.down_proj.weight,[4096,10944],BF16
model.layers.0.mlp.gate_proj.weight,[10944,4096],BF16
model.layers.0.mlp.up_proj.weight,[10944,4096],BF16
model.layers.[0-45].input_layernorm.weight,[4096],BF16
model.layers.[0-45].post_attention_layernorm.weight,[4096],BF16
model.layers.[0-45].self_attn.k_proj.bias,[1024],BF16
model.layers.[0-45].self_attn.k_proj.weight,[1024,4096],BF16
model.layers.[0-45].self_attn.o_proj.weight,[4096,12288],BF16
model.layers.[0-45].self_attn.q_proj.bias,[12288],BF16
model.layers.[0-45].self_attn.q_proj.weight,[12288,4096],BF16
model.layers.[0-45].self_attn.v_proj.bias,[1024],BF16
model.layers.[0-45].self_attn.v_proj.weight,[1024,4096],BF16
model.layers.[1-45].mlp.experts.[0-127].down_proj.weight,[4096,1408],BF16
model.layers.[1-45].mlp.experts.[0-127].gate_proj.weight,[1408,4096],BF16
model.layers.[1-45].mlp.experts.[0-127].up_proj.weight,[1408,4096],BF16
model.layers.[1-45].mlp.gate.e_score_correction_bias,[128],F32
model.layers.[1-45].mlp.gate.weight,[128,4096],BF16
model.layers.[1-45].mlp.shared_experts.down_proj.weight,[4096,1408],BF16
model.layers.[1-45].mlp.shared_experts.gate_proj.weight,[1408,4096],BF16
model.layers.[1-45].mlp.shared_experts.up_proj.weight,[1408,4096],BF16
model.norm.weight,[4096],BF16

GLM-4.6V

lm_head.weight,[151552,4096],BF16
model.language_model.embed_tokens.weight,[151552,4096],BF16
model.language_model.layers.0.mlp.down_proj.weight,[4096,10944],BF16
model.language_model.layers.0.mlp.gate_proj.weight,[10944,4096],BF16
model.language_model.layers.0.mlp.up_proj.weight,[10944,4096],BF16
model.language_model.layers.[0-45].input_layernorm.weight,[4096],BF16
model.language_model.layers.[0-45].post_attention_layernorm.weight,[4096],BF16
model.language_model.layers.[0-45].self_attn.k_proj.bias,[1024],BF16
model.language_model.layers.[0-45].self_attn.k_proj.weight,[1024,4096],BF16
model.language_model.layers.[0-45].self_attn.o_proj.weight,[4096,12288],BF16
model.language_model.layers.[0-45].self_attn.q_proj.bias,[12288],BF16
model.language_model.layers.[0-45].self_attn.q_proj.weight,[12288,4096],BF16
model.language_model.layers.[0-45].self_attn.v_proj.bias,[1024],BF16
model.language_model.layers.[0-45].self_attn.v_proj.weight,[1024,4096],BF16
model.language_model.layers.[1-45].mlp.experts.[0-127].down_proj.weight,[4096,1408],BF16
model.language_model.layers.[1-45].mlp.experts.[0-127].gate_proj.weight,[1408,4096],BF16
model.language_model.layers.[1-45].mlp.experts.[0-127].up_proj.weight,[1408,4096],BF16
model.language_model.layers.[1-45].mlp.gate.e_score_correction_bias,[128],F32
model.language_model.layers.[1-45].mlp.gate.weight,[128,4096],BF16
model.language_model.layers.[1-45].mlp.shared_experts.down_proj.weight,[4096,1408],BF16
model.language_model.layers.[1-45].mlp.shared_experts.gate_proj.weight,[1408,4096],BF16
model.language_model.layers.[1-45].mlp.shared_experts.up_proj.weight,[1408,4096],BF16
model.language_model.norm.weight,[4096],BF16
model.visual.blocks.[0-23].attn.proj.weight,[1536,1536],BF16
model.visual.blocks.[0-23].attn.qkv.weight,[4608,1536],BF16
model.visual.blocks.[0-23].mlp.down_proj.weight,[1536,4096],BF16
model.visual.blocks.[0-23].mlp.gate_proj.weight,[4096,1536],BF16
model.visual.blocks.[0-23].mlp.up_proj.weight,[4096,1536],BF16
model.visual.blocks.[0-23].norm[1-2].weight,[1536],BF16
model.visual.downsample.bias,[4096],BF16
model.visual.downsample.weight,[4096,1536,2,2],BF16
model.visual.embeddings.position_embedding.weight,[576,1536],BF16
model.visual.merger.down_proj.weight,[4096,10944],BF16
model.visual.merger.gate_proj.weight,[10944,4096],BF16
model.visual.merger.post_projection_norm.bias,[4096],BF16
model.visual.merger.post_projection_norm.weight,[4096],BF16
model.visual.merger.proj.weight,[4096,4096],BF16
model.visual.merger.up_proj.weight,[10944,4096],BF16
model.visual.patch_embed.proj.bias,[1536],BF16
model.visual.patch_embed.proj.weight,[1536,3,2,14,14],BF16
model.visual.post_conv_layernorm.weight,[1536],BF16
model.visual.post_layernorm.weight,[1536],BF16

License

Please refer to the licenses of the source models:

Acknowledgments

Prime Intellect for INTELLECT-3
THUDM for GLM-4.6V

Author: QuixiAI

Likes: 2

Downloads: 0

Tags: safetensors, glm4v_moe, region:us

AliceThirty/GLM-4.6V-gguf

tags:

gguf base_model: zai-org/GLM-4.6V

This is an experiment. Llama.cpp does not support GLM-4.5V and GLM-4.6V yet. I made llama.cpp believe that it's the GLM-4.5-Air architecture (so this model can only process text). It seems to have worked.

Author: AliceThirty

Likes: 2

Downloads: 0

Tags: gguf, base_model:zai-org/GLM-4.6V, base_model:quantized:zai-org/GLM-4.6V, endpoints_compatible, region:us, conversational

Todays AI Summary

AI Developments: Scriptwriting LLMs, Mobile Agents, and Efficiency Optimizations

Noteworthy Papers

Model Highlights

Key Takeaways

AI Papers for 2025-12-09

Enhancing Retrieval-Augmented Generation with Entity Linking for Educational Platforms

Training-Time Action Conditioning for Efficient Real-Time Chunking

Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity

AQUA-Net: Adaptive Frequency Fusion and Illumination Aware Network for Underwater Image Enhancement

M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG

MaxShapley: Towards Incentive-compatible Generative Search with Fair Context Attribution

SymPyBench: A Dynamic Benchmark for Scientific Reasoning with Executable Python Code

Trusted AI Agents in the Cloud

Impugan: Learning Conditional Generative Models for Robust Data Imputation

Variational Quantum Rainbow Deep Q-Network for Optimizing Resource Allocation Problem

AI Models

FutureMa/Qwen3-8B-Drama-Thinking

Qwen3-8B-Drama-Thinking

Model Description

Key Features

🎬 Professional Screenwriting Assistant

📊 Performance Comparison

🎯 Unique Value Proposition

Training Details

Training Configuration

Dataset Characteristics

Training Metrics

Usage

Quick Start (ms-swift)

Python API

Using with Transformers (requires adapting checkpoint)

Example Output

Intended Use

✅ Recommended Use Cases

❌ Not Recommended For

Evaluation Results

Quantitative Metrics (vs. Base Model)

Qualitative Improvements

Limitations

Training Insights

What Made This Successful

Citation

Acknowledgments

Model Card Contact

mradermacher/GLM-4.6V-Flash-GGUF

About

Usage

Provided Quants

FAQ / Model Request

Thanks

zai-org/AutoGLM-Phone-9B

AutoGLM-Phone-9B

Project Overview

Model Usage

Citation

turboderp/GLM-4.6V-exl3

embedl/Llama-3.2-3B-Instruct-FlashHead

Llama-3.2-3B-Instruct-FlashHead

Model Details

Optimizations

Performance

Token Generation Speed (RTX 3500 Ada, batch size = 1)

Accuracy (Parity with Baseline)

Installation

Usage Examples

vLLM Inference

Interactive REPL Example

⚠️ Important Warning: Hugging Face Transformers Support

Limitations

Roadmap

License

Contact

Partner & Developer Opportunities

embedl/Llama-3.2-3B-Instruct-FlashHead-W4A16

Llama-3.2-3B-Instruct-FlashHead-W4A16

Model Details

Optimizations

Performance

Token Generation Speed (RTX 3500 Ada, batch size = 1)