Todays AI Summary

AI Model Highlights: Ring-flash-2.0 Excels in Reasoning, Orion Automates Fuzzing

Today's AI landscape is marked by advancements in reasoning capabilities, efficiency, and automation. Here's a look at the most interesting developments:

Noteworthy Research Papers

  • Generalizable Geometric Image Caption Synthesis: A paper introduces a Reinforcement Learning with Verifiable Rewards (RLVR) process to refine captions for geometric images, improving the reasoning capabilities of multimodal large language models. The generated dataset enhances general reasoning capabilities, yielding accuracy improvements of $2.8%-4.8%$ in statistics, arithmetic, algebraic, and numerical tasks with non-geometric input images of MathVista and MathVerse, along with $2.4%-3.9%$ improvements in Art, Design, Tech, and Engineering tasks in MMMU.
  • FlowRL: Matching Reward Distributions for LLM Reasoning: This paper proposes FlowRL, a method that matches the full reward distribution via flow balancing in LLM reinforcement learning, promoting diverse exploration and generalizable reasoning trajectories. FlowRL achieves a significant average improvement of $10.0%$ over GRPO and $5.1%$ over PPO on math benchmarks, and performs consistently better on code reasoning tasks.
  • Orion: Fuzzing Workflow Automation: This paper introduces Orion, a framework that automates the manual bottlenecks of fuzzing by integrating LLM reasoning with traditional tools, reducing human effort by 46-204x and discovering two previously unknown vulnerabilities in the widely used open-source clib library.
  • Internalizing Self-Consistency in Language Models: This paper introduces Multi-Agent Consensus Alignment (MACA), a reinforcement learning framework that post-trains models to favor reasoning trajectories aligned with their internal consensus using majority/minority outcomes from multi-agent debate. MACA enables agents to teach themselves to be more decisive and concise, and better leverage peer insights in multi-agent settings without external supervision, driving substantial improvements across self-consistency (+27.6% on GSM8K), single-agent reasoning (+23.7% on MATH), sampling-based inference (+22.4% Pass@20 on MATH), and multi-agent ensemble decision-making (+42.7% on MathQA).

Model Spotlight: Ring-flash-2.0

The most notable model release is inclusionAI/Ring-flash-2.0, boasting 31 likes. This model is a high-performance thinking model optimized from Ling-flash-2.0-base. Key features include:

  • Architecture: 100B parameters with only 6.1B activated per inference, using the icepop algorithm to address training instability in reinforcement learning (RL) for MoE LLMs.
  • Performance: Demonstrates breakthroughs in math competitions, code generation, and logical reasoning, surpassing SOTA dense models under 40B parameters.
  • Efficiency: Achieves a high generation speed of 200+ tokens/sec on four H20 GPUs.
  • Training: Utilizes a multi-stage training approach (SFT + RLVR + RLHF) to enhance capabilities.

Key Takeaways

  • Reasoning is a Key Focus: Several models and papers are focusing on improving the reasoning capabilities of LLMs, particularly in complex domains like mathematics and code.
  • Efficiency Matters: Ring-flash-2.0 highlights the importance of efficient architectures that can deliver high performance with fewer activated parameters.
  • Automation is Advancing: Orion demonstrates the potential of LLMs to automate complex workflows like fuzzing, significantly reducing human effort.
  • Self-Consistency: MACA highlights the importance of self-consistency in language models and introduces a novel approach to improve it.

AI Papers for 2026-03-24

From Masks to Pixels and Meaning: A New Taxonomy, Benchmark, and Metrics for VLM Image Tampering

Existing tampering detection benchmarks largely rely on object masks, which severely misalign with the true edit signal: many pixels inside a mask are untouched or only trivially modified, while subtle yet consequential edits outside the mask are treated as natural. We reformulate VLM image tampering from coarse region labels to a pixel-grounded, meaning and language-aware task. First, we introduce a taxonomy spanning edit primitives (replace/remove/splice/inpaint/attribute/colorization, etc.) and their semantic class of tampered object, linking low-level changes to high-level understanding. Second, we release a new benchmark with per-pixel tamper maps and paired category supervision to evaluate detection and classification within a unified protocol. Third, we propose a training framework and evaluation metrics that quantify pixel-level correctness with localization to assess confidence or prediction on true edit intensity, and further measure tamper meaning understanding via semantics-aware classification and natural language descriptions for the predicted regions. We also re-evaluate the existing strong segmentation/localization baselines on recent strong tamper detectors and reveal substantial over- and under-scoring using mask-only metrics, and expose failure modes on micro-edits and off-mask changes. Our framework advances the field from masks to pixels, meanings and language descriptions, establishing a rigorous standard for tamper localization, semantic classification and description. Code and benchmark data are available at https://github.com/VILA-Lab/PIXAR.

LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation

Recent advances in diffusion models have significantly improved text-to-video generation, enabling personalized content creation with fine-grained control over both foreground and background elements. However, precise face-attribute alignment across subjects remains challenging, as existing methods lack explicit mechanisms to ensure intra-group consistency. Addressing this gap requires both explicit modeling strategies and face-attribute-aware data resources. We therefore propose LumosX, a framework that advances both data and model design. On the data side, a tailored collection pipeline orchestrates captions and visual cues from independent videos, while multimodal large language models (MLLMs) infer and assign subject-specific dependencies. These extracted relational priors impose a finer-grained structure that amplifies the expressive control of personalized video generation and enables the construction of a comprehensive benchmark. On the modeling side, Relational Self-Attention and Relational Cross-Attention intertwine position-aware embeddings with refined attention dynamics to inscribe explicit subject-attribute dependencies, enforcing disciplined intra-group cohesion and amplifying the separation between distinct subject clusters. Comprehensive evaluations on our benchmark demonstrate that LumosX achieves state-of-the-art performance in fine-grained, identity-consistent, and semantically aligned personalized multi-subject video generation. Code and models are available at https://jiazheng-xing.github.io/lumosx-home/.

VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking

Video agentic models have advanced challenging video-language tasks. However, most agentic approaches still heavily rely on greedy parsing over densely sampled video frames, resulting in high computational cost. We present VideoSeek, a long-horizon video agent that leverages video logic flow to actively seek answer-critical evidence instead of exhaustively parsing the full video. This insight allows the model to use far fewer frames while maintaining, or even improving, its video understanding capability. VideoSeek operates in a think-act-observe loop with a well-designed toolkit for collecting multi-granular video observations. This design enables query-aware exploration over accumulated observations and supports practical video understanding and reasoning. Experiments on four challenging video understanding and reasoning benchmarks demonstrate that VideoSeek achieves strong accuracy while using far fewer frames than prior video agents and standalone LMMs. Notably, VideoSeek achieves a 10.2 absolute points improvement on LVBench over its base model, GPT-5, while using 93% fewer frames. Further analysis highlights the significance of leveraging video logic flow, strong reasoning capability, and the complementary roles of toolkit design.

Improving Generalization on Cybersecurity Tasks with Multi-Modal Contrastive Learning

The use of ML in cybersecurity has long been impaired by generalization issues: Models that work well in controlled scenarios fail to maintain performance in production. The root cause often lies in ML algorithms learning superficial patterns (shortcuts) rather than underlying cybersecurity concepts. We investigate contrastive multi-modal learning as a first step towards improving ML performance in cybersecurity tasks. We aim at transferring knowledge from data-rich modalities, such as text, to data-scarce modalities, such as payloads. We set up a case study on threat classification and propose a two-stage multi-modal contrastive learning framework that uses textual vulnerability descriptions to guide payload classification. First, we construct a semantically meaningful embedding space using contrastive learning on descriptions. Then, we align payloads to this space, transferring knowledge from text to payloads. We evaluate the approach on a large-scale private dataset and a synthetic benchmark built from public CVE descriptions and LLM-generated payloads. The methodology appears to reduce shortcut learning over baselines on both benchmarks. We release our synthetic benchmark and source code as open source.

Adaptive Greedy Frame Selection for Long Video Understanding

Large vision--language models (VLMs) are increasingly applied to long-video question answering, yet inference is often bottlenecked by the number of input frames and resulting visual tokens. Naive sparse sampling can miss decisive moments, while purely relevance-driven selection frequently collapses onto near-duplicate frames and sacrifices coverage of temporally distant evidence. We propose a question-adaptive greedy frame selection method that jointly optimizes query relevance and semantic representativeness under a fixed frame budget. Our approach constructs a 1~FPS candidate pool (capped at 1000) with exact timestamp alignment, embeds candidates in two complementary spaces (SigLIP for question relevance and DINOv2 for semantic similarity), and selects frames by greedily maximizing a weighted sum of a modular relevance term and a facility-location coverage term. This objective is normalized, monotone, and submodular, yielding a standard (1-1/e) greedy approximation guarantee. To account for question-dependent trade-offs between relevance and coverage, we introduce four preset strategies and a lightweight text-only question-type classifier that routes each query to its best-performing preset. Experiments on MLVU show consistent accuracy gains over uniform sampling and a strong recent baseline across frame budgets, with the largest improvements under tight budgets.

AI Agents Can Already Autonomously Perform Experimental High Energy Physics

Large language model-based AI agents are now able to autonomously execute substantial portions of a high energy physics (HEP) analysis pipeline with minimal expert-curated input. Given access to a HEP dataset, an execution framework, and a corpus of prior experimental literature, we find that Claude Code succeeds in automating all stages of a typical analysis: event selection, background estimation, uncertainty quantification, statistical inference, and paper drafting. We argue that the experimental HEP community is underestimating the current capabilities of these systems, and that most proposed agentic workflows are too narrowly scoped or scaffolded to specific analysis structures. We present a proof-of-concept framework, Just Furnish Context (JFC), that integrates autonomous analysis agents with literature-based knowledge retrieval and multi-agent review, and show that this is sufficient to plan, execute, and document a credible high energy physics analysis. We demonstrate this by conducting analyses on open data from ALEPH, DELPHI, and CMS to perform electroweak, QCD, and Higgs boson measurements. Rather than replacing physicists, these tools promise to offload the repetitive technical burden of analysis code development, freeing researchers to focus on physics insight, truly novel method development, and rigorous validation. Given these developments, we advocate for new strategies for how the community trains students, organizes analysis efforts, and allocates human expertise.

Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation

Recent work on chain-of-thought (CoT) faithfulness reports single aggregate numbers (e.g., DeepSeek-R1 acknowledges hints 39% of the time), implying that faithfulness is an objective, measurable property of a model. This paper demonstrates that it is not. Three classifiers (a regex-only detector, a two-stage regex-plus-LLM pipeline, and an independent Claude Sonnet 4 judge) are applied to 10,276 influenced reasoning traces from 12 open-weight models spanning 9 families and 7B to 1T parameters. On identical data, these classifiers produce overall faithfulness rates of 74.4%, 82.6%, and 69.7%, respectively, with non-overlapping 95% confidence intervals. Per-model gaps range from 2.6 to 30.6 percentage points; all are statistically significant (McNemar's test, p < 0.001). The disagreements are systematic, not random: inter-classifier agreement measured by Cohen's kappa ranges from 0.06 ("slight") for sycophancy hints to 0.42 ("moderate") for grader hints, and the asymmetry is pronounced: for sycophancy, 883 cases are classified as faithful by the pipeline but unfaithful by the Sonnet judge, while only 2 go the other direction. Classifier choice can also reverse model rankings: Qwen3.5-27B ranks 1st under the pipeline but 7th under the Sonnet judge; OLMo-3.1-32B moves in the opposite direction, from 9th to 3rd. The root cause is that different classifiers operationalize related faithfulness constructs at different levels of stringency (lexical mention versus epistemic dependence), and these constructs yield divergent measurements on the same behavior. These results demonstrate that published faithfulness numbers cannot be meaningfully compared across studies that use different classifiers, and that future evaluations should report sensitivity ranges across multiple classification methodologies rather than single point estimates.

Learning Dynamic Belief Graphs for Theory-of-mind Reasoning

Theory of Mind (ToM) reasoning with Large Language Models (LLMs) requires inferring how people's implicit, evolving beliefs shape what they seek and how they act under uncertainty -- especially in high-stakes settings such as disaster response, emergency medicine, and human-in-the-loop autonomy. Prior approaches either prompt LLMs directly or use latent-state models that treat beliefs as static and independent, often producing incoherent mental models over time and weak reasoning in dynamic contexts. We introduce a structured cognitive trajectory model for LLM-based ToM that represents mental state as a dynamic belief graph, jointly inferring latent beliefs, learning their time-varying dependencies, and linking belief evolution to information seeking and decisions. Our model contributes (i) a novel projection from textualized probabilistic statements to consistent probabilistic graphical model updates, (ii) an energy-based factor graph representation of belief interdependencies, and (iii) an ELBO-based objective that captures belief accumulation and delayed decisions. Across multiple real-world disaster evacuation datasets, our model significantly improves action prediction and recovers interpretable belief trajectories consistent with human reasoning, providing a principled module for augmenting LLMs with ToM in high-uncertainty environment. https://anonymous.4open.science/r/ICML_submission-6373/

The Robot's Inner Critic: Self-Refinement of Social Behaviors through VLM-based Replanning

Conventional robot social behavior generation has been limited in flexibility and autonomy, relying on predefined motions or human feedback. This study proposes CRISP (Critique-and-Replan for Interactive Social Presence), an autonomous framework where a robot critiques and replans its own actions by leveraging a Vision-Language Model (VLM) as a `human-like social critic.' CRISP integrates (1) extraction of movable joints and constraints by analyzing the robot's description file (e.g., MJCF), (2) generation of step-by-step behavior plans based on situational context, (3) generation of low-level joint control code by referencing visual information (joint range-of-motion visualizations), (4) VLM-based evaluation of social appropriateness and naturalness, including pinpointing erroneous steps, and (5) iterative refinement of behaviors through reward-based search. This approach is not tied to a specific robot API; it can generate subtly different, human-like motions on various platforms using only the robot's structure file. In a user study involving five different robot types and 20 scenarios, including mobile manipulators and humanoids, our proposed method achieved significantly higher preference and situational appropriateness ratings compared to previous methods. This research presents a general framework that minimizes human intervention while expanding the robot's autonomous interaction capabilities and cross-platform applicability. Detailed result videos and supplementary information regarding this work are available at: https://limjiyu99.github.io/inner-critic/

Semantic Token Clustering for Efficient Uncertainty Quantification in Large Language Models

Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks. However, the truthfulness of their outputs is not guaranteed, and their tendency toward overconfidence further limits reliability. Uncertainty quantification offers a promising way to identify potentially unreliable outputs, but most existing methods rely on repeated sampling or auxiliary models, introducing substantial computational overhead. To address these limitations, we propose Semantic Token Clustering (STC), an efficient uncertainty quantification method that leverages the semantic information inherently encoded in LLMs. Specifically, we group tokens into semantically consistent clusters using embedding clustering and prefix matching, and quantify uncertainty based on the probability mass aggregated over the corresponding semantic cluster. Our approach requires only a single generation and does not depend on auxiliary models. Experimental results show that STC achieves performance comparable to state-of-the-art baselines while substantially reducing computational overhead.

AI Models

sarvamai/sarvam-30b-gguf


language:

  • en
  • hi
  • bn
  • ta
  • te
  • mr
  • gu
  • kn
  • ml
  • pa
  • or
  • as
  • ur
  • sa
  • ne
  • sd
  • kok
  • mai
  • doi
  • mni
  • sat
  • ks
  • bo library_name: transformers license: apache-2.0 pipeline_tag: text-generation

image

!!! This is the GGUF version of Sarvam-30B !!!

Download the original weights here!

Index

  1. Introduction
  2. Architecture
  3. Benchmarks
    • Knowledge & Coding
    • Reasoning & Math
    • Agentic
  4. Inference
  5. Footnote
  6. Citation

Introduction

Sarvam-30B is an advanced Mixture-of-Experts (MoE) model with 2.4B non-embedding active parameters, designed primarily for practical deployment. It combines strong reasoning, reliable coding ability, and best-in-class conversational quality across Indian languages. Sarvam-30B is built to run reliably in resource-constrained environments and can handle multilingual voice calls while performing tool calls.

A major focus during training was the Indian context and languages, resulting in state-of-the-art performance across 22 Indian languages for its model size.

Sarvam-30B is open-sourced under the Apache License. For more details, see our blog.

Architecture

The 30B MoE model is designed for throughput and memory efficiency. It uses 19 layers, a dense FFN intermediate_size of 8192, moe_intermediate_size of 1024, top-6 routing, grouped KV heads (num_key_value_heads=4), and an extremely high rope_theta (8e6) for long-context stability without RoPE scaling. It has 128 experts with a shared expert, a routed scaling factor of 2.5, and auxiliary-loss-free router balancing. The 30B model focuses on throughput and memory efficiency through fewer layers, grouped KV attention, and smaller experts.

Benchmarks

<details> <summary>Knowledge & Coding</summary>

| Benchmark | Sarvam-30B | Gemma 27B It | Mistral-3.2-24B | OLMo 3.1 32B Think | Nemotron-3-Nano-30B-A3B | Qwen3-30B-Thinking-2507 | GLM 4.7 Flash | GPT-OSS-20B | |---|---|---|---|---|---|---|---|---| | Math500 | 97.0 | 87.4 | 69.4 | 96.2 | 98.0 | 97.6 | 97.0 | 94.2 | | HumanEval | 92.1 | 88.4 | 92.9 | 95.1 | 97.6 | 95.7 | 96.3 | 95.7 | | MBPP | 92.7 | 81.8 | 78.3 | 58.7 | 91.9 | 94.3 | 91.8 | 95.3 | | Live Code Bench v6 | 70.0 | 28.0 | 26.0 | 73.0 | 68.3 | 66.0 | 64.0 | 61.0 | | MMLU | 85.1 | 81.2 | 80.5 | 86.4 | 84.0 | 88.4 | 86.9 | 85.3 | | MMLU Pro | 80.0 | 68.1 | 69.1 | 72.0 | 78.3 | 80.9 | 73.6 | 75.0 | | MILU | 76.8 | 69.2 | 67.9 | 69.9 | 64.8 | 82.6 | 75.6 | 73.7 | | Arena Hard v2 | 49.0 | 50.1 | 43.1 | 42.0 | 67.7 | 72.1 | 58.1 | 62.9 | | Writing Bench | 78.7 | 71.4 | 70.3 | 75.7 | 83.7 | 85.0 | 79.2 | 79.1 |

</details> <details> <summary>Reasoning & Math</summary>

| Benchmark | Sarvam-30B | OLMo 3.1 32B | Nemotron-3-Nano-30B | Qwen3-30B-Thinking-2507 | GLM 4.7 Flash | GPT-OSS-20B | |---|---|---|---|---|---|---| | GPQA Diamond | 66.5 | 57.5 | 73.0 | 73.4 | 75.2 | 71.5 | | AIME 25 (w/ Tools) | 88.3 (96.7) | 78.1 (81.7) | 89.1 (99.2) | 85.0 (-) | 91.6 (-) | 91.7 (98.7) | | HMMT (Feb 25) | 73.3 | 51.7 | 85.0 | 71.4 | 85.0 | 76.7 | | HMMT (Nov 25) | 74.2 | 58.3 | 75.0 | 73.3 | 81.7 | 68.3 | | Beyond AIME | 58.3 | 48.5 | 64.0 | 61.0 | 60.0 | 46.0 |

</details> <details> <summary>Agentic</summary>

| Benchmark | Sarvam-30B | Nemotron-3-Nano-30B | Qwen3-30B-Thinking-2507 | GLM 4.7 Flash | GPT-OSS-20B | |---|---|---|---|---|---| | BrowseComp | 35.5 | 23.8 | 2.9 | 42.8 | 28.3 | | SWE Bench Verified | 34.0 | 38.8 | 22.0 | 59.2 | 34.0 | | τ² Bench (avg.) | 45.7 | 49.0 | 47.7 | 79.5 | 48.7 |

See footnote for evaluation details.

</details>

Inference

Clone and build

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp && cmake -B build && cmake --build build --config Release -j

Download the model (all shards)

huggingface-cli download sarvamai/sarvam-30b-gguf --local-dir sarvam-30b-gguf

Run interactive chat

./build/bin/llama-cli \
  -m sarvam-30b-gguf/sarvam-30b-Q4_K_M.gguf-00001-of-00006.gguf \
  -c 4096 \
  -n 512 \
  -p "You are a helpful assistant." \
  --conversation

OpenAI-compatible API server

./build/bin/llama-server \
  -m sarvam-30b-gguf/sarvam-30b-Q4_K_M.gguf-00001-of-00006.gguf \
  -c 4096 \
  --host 0.0.0.0 \
  --port 8080

Then query it:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    "temperature": 0.8,
    "max_tokens": 512
  }'

Footnote

  • General settings: All benchmarks are evaluated with a maximum context length of 65,536 tokens.
  • Reasoning & Math benchmarks (Math500, MMLU, MMLU Pro, GPQA Diamond, AIME 25, Beyond AIME, HMMT, HumanEval, MBPP): Evaluated with temperature=1.0, top_p=1.0, max_new_tokens=65536.
  • Coding & Knowledge benchmarks (Live Code Bench v6, Arena Hard v2, IF Eval): Evaluated with temperature=1.0, top_p=1.0, max_new_tokens=65536.
  • Writing Bench: Responses generated using official Writing-Bench parameters: temperature=0.7, top_p=0.8, top_k=20, max_length=16000. Scoring performed using the official Writing-Bench critic model with: temperature=1.0, top_p=0.95, max_length=2048.
  • Agentic benchmarks (BrowseComp, SWE Bench Verified, τ² Bench): Evaluated with temperature=0.5, top_p=1.0, max_new_tokens=32768.

Citation

@misc{sarvam_sovereign_models,
  title        = {Introducing Sarvam's Sovereign Models},
  author       = {{Sarvam Foundation Models Team}},
  year         = {2026},
  howpublished = {\url{https://www.sarvam.ai/blogs/sarvam-30b-105b}},
  note         = {Accessed: 2026-03-03}
}

Author: sarvamai

Likes: 9

Downloads: 0

Tags: transformers, gguf, text-generation, en, hi, bn, ta, te, mr, gu, kn, ml, pa, or, as, ur, sa, ne, sd, kok, mai, doi, mni, sat, ks, bo, license:apache-2.0, endpoints_compatible, region:us, conversational

JANGQ-AI/Mistral-Small-4-119B-A6B-JANG_2L


language:

  • en
  • fr
  • de
  • es
  • ja
  • ko
  • zh library_name: mlx license: apache-2.0 base_model: mistralai/Mistral-Small-4-119B-2603 tags:
  • jang
  • quantized
  • mixed-precision
  • apple-silicon
  • mlx
  • vlm
  • reasoning
  • thinking
  • moe
  • mla pipeline_tag: text-generation

<p align="center"> <a href="https://mlx.studio"><img src="https://raw.githubusercontent.com/jjang-ai/jangq/main/assets/mlx-studio-light.png" alt="MLX Studio" width="500"></a> </p> <h4 align="center"><a href="https://mlx.studio">MLX Studio</a> — the only app that natively supports JANG models with reasoning</h4>
<p align="center"> <img src="https://raw.githubusercontent.com/jjang-ai/jangq/main/assets/jangq-logo-dark.png" alt="JANG" width="300"> </p> <h3 align="center">Mistral Small 4 (119B-A6B) — JANG_2L (2.14-bit) — Reasoning + VLM</h3> <p align="center"><b>JANG</b> — Jang Adaptive N-bit Grading | The GGUF Equivalent for MLX</p> <p align="center"> <a href="https://github.com/jjang-ai/jangq"><img src="https://img.shields.io/badge/GitHub-Source_Code-blue?logo=github" alt="GitHub"></a>&nbsp; <a href="https://pypi.org/project/jang/"><img src="https://img.shields.io/pypi/v/jang?label=PyPI&color=green" alt="PyPI"></a>&nbsp; <a href="https://jangq.ai"><img src="https://img.shields.io/badge/Web-jangq.ai-orange" alt="Website"></a>&nbsp; <a href="https://x.com/dealignai"><img src="https://img.shields.io/badge/X-@dealignai-black?logo=x" alt="X/Twitter"></a> </p>

JANG is fully open-source. Quantization engine and full commit history: github.com/jjang-ai/jangq. Created by Jinho Jang.


First Mistral Small 4 (119B) on Apple Silicon. MLA attention + 128 MoE experts + Pixtral VLM. 5x faster prefill than MLX Community 4-bit.

Reasoning mode: Set reasoning_effort to "high" for step-by-step reasoning with [THINK]...[/THINK] tags.


Speed Comparison — JANG vs MLX Community

| Model | Size | Gen tok/s | Prefill tok/s | RAM | Fits On | |-------|:----:|:---------:|:-------------:|:---:|---------| | JANG_2L (this model) | 30 GB | 82 | 216 | 40 GB | 48 GB Macs | | JANG_4M | 57 GB | 80 | 202 | 68 GB | 96+ GB Macs | | JANG_6M | 84 GB | 74 | 160 | 95 GB | 128+ GB Macs | | MLX Community 4-bit | 63 GB | 84 | 43 | 68 GB | 96+ GB Macs |

  • 5x faster prefill (216 vs 43 tok/s)
  • Half the size (30 GB vs 63 GB) at comparable generation speed
  • Benchmarked on M3 Ultra 256 GB with bfloat16 compute

Key Features

  • 82 tok/s generation on M3 Ultra — matches MLX 4-bit at half the size
  • 30 GB on disk, 40 GB peak RAM — fits 48 GB Macs (M4 Pro, M2/M3 Max)
  • Vision (VLM): Pixtral encoder, 1540px max, processes images
  • Reasoning mode: [THINK]...[/THINK] step-by-step reasoning
  • Code generation: Complete functions with docstrings and optimized logic
  • Math: Step-by-step calculations with distributive property
  • 119B total / 6B active per token — MLA attention + 128 MoE experts

Architecture

JANG_2L Bit Allocation

| Tensor Type | Bits | Purpose | |------------|:----:|---------| | Attention (q/k/v/o projections) | 8 | Critical — preserves MLA precision | | Embeddings, lm_head | 8 | Critical — token representation | | MoE gate (router) | 16 | Float16 passthrough — routing precision | | Shared experts | 6 | Important — always active | | Routed experts (128) | 2 | Compressed — many experts = redundancy | | Norms, biases | full | Float — tiny tensors, keep exact |

Benchmarks

MMLU benchmarks in progress — will be updated with per-subject scores and MLX 4-bit comparison.

Requirements

  • MLX Studio for native JANG support with reasoning
  • Or: - Apple Silicon Mac with 48+ GB unified memory

Install

\

Created by Jinho Jangjangq.ai@dealignai

Author: JANGQ-AI

Likes: 5

Downloads: 9

Tags: mlx, safetensors, mistral3, jang, quantized, mixed-precision, apple-silicon, vlm, reasoning, thinking, moe, mla, text-generation, conversational, en, fr, de, es, ja, ko, zh, base_model:mistralai/Mistral-Small-4-119B-2603, base_model:quantized:mistralai/Mistral-Small-4-119B-2603, license:apache-2.0, fp8, region:us

Aratako/Irodori-TTS-500M-v2


license: mit language:

  • ja pipeline_tag: text-to-speech tags:
  • speech
  • voice
  • tts

Irodori-TTS-500M-v2

Code WandB Demo Space

Irodori-TTS-500M-v2 is a Japanese Text-to-Speech model based on a Rectified Flow Diffusion Transformer (RF-DiT) architecture. The architecture and training design largely follow Echo-TTS, using continuous latents as the generation target. It supports zero-shot voice cloning from reference audio.

A unique feature of this model is emoji-based style and sound effect control — by inserting specific emojis into the input text, you can control speaking styles, emotions, and even sound effects in the generated audio.

🌟 Key Features

  • Flow Matching TTS: Rectified Flow Diffusion Transformer over continuous DACVAE latents for high-quality Japanese speech synthesis.
  • Voice Cloning: Zero-shot voice cloning from a short reference audio clip.
  • Emoji-based Style Control: Control speaking styles, emotions, and sound effects by embedding emojis directly in the input text. See EMOJI_ANNOTATIONS.md for the full list of supported emojis and their effects.

✨ What's New in v2

This version brings several improvements over the original Irodori-TTS-500M:

  • Upgraded VAE: Switched the audio VAE to Aratako/Semantic-DACVAE-Japanese-32dim, enabling higher-quality Japanese speech generation.
  • Extended Training: The number of training steps has been increased by 2.5 times, resulting in better convergence, stability, and overall audio fidelity.
  • Data & Preprocessing Improvements: Implemented refined text preprocessing pipelines and stricter data filtering to enhance the model's robustness and output quality.

🏗️ Architecture

The model (approximately 500M parameters) consists of three main components:

  1. Text Encoder: Token embeddings initialized from llm-jp/llm-jp-3-150m, followed by self-attention + SwiGLU transformer layers with RoPE.
  2. Reference Latent Encoder: Encodes patched reference audio latents for speaker/style conditioning via self-attention + SwiGLU layers.
  3. Diffusion Transformer: Joint-attention DiT blocks with Low-Rank AdaLN (timestep-conditioned adaptive layer normalization), half-RoPE, and SwiGLU MLPs.

Audio is represented as continuous latent sequences via the Aratako/Semantic-DACVAE-Japanese-32dim codec (32-dim), enabling high-quality 48kHz waveform reconstruction.

🎧 Audio Samples

1. Standard TTS

Basic Japanese text-to-speech generation (without reference audio).

| Case | Text | Generated Audio | | :--- | :--- | :--- | | Sample 1 | "お電話ありがとうございます。ただいま電話が大変混み合っております。恐れ入りますが、発信音のあとに、ご用件をお話しください。" | <audio controls src="https://huggingface.co/Aratako/Irodori-TTS-500M-v2/resolve/main/samples/standard_sample1.wav"></audio> | | Sample 2 | "その森には、古い言い伝えがありました。月が最も高く昇る夜、静かに耳を澄ませば、風の歌声が聞こえるというのです。私は半信半疑でしたが、その夜、確かに誰かが私を呼ぶ声を聞いたのです。" | <audio controls src="https://huggingface.co/Aratako/Irodori-TTS-500M-v2/resolve/main/samples/standard_sample2.wav"></audio> |

2. Emoji Annotation Control

Examples of controlling speaking style and effects with emojis. For the full list of supported emojis, see EMOJI_ANNOTATIONS.md.

| Case | Text (with Emoji) | Generated Audio | | :--- | :--- | :--- | | Sample 1 | なーに、どうしたの?…え?もっと近づいてほしい?…👂😮‍💨👂😮‍💨こういうのが好きなんだ? | <audio controls src="https://huggingface.co/Aratako/Irodori-TTS-500M-v2/resolve/main/samples/emoji_sample1.wav"></audio> | | Sample 2 | うぅ…😭そんなに酷いこと、言わないで…😭 | <audio controls src="https://huggingface.co/Aratako/Irodori-TTS-500M-v2/resolve/main/samples/emoji_sample2.wav"></audio> | | Sample 3 | 🤧🤧ごめんね、風邪引いちゃってて🤧…大丈夫、ただの風邪だからすぐ治るよ🥺 | <audio controls src="https://huggingface.co/Aratako/Irodori-TTS-500M-v2/resolve/main/samples/emoji_sample3.wav"></audio> |

3. Voice Cloning (Zero-shot)

Examples of cloning a voice from a reference audio clip.

| Case | Reference Audio | Generated Audio | | :--- | :--- | :--- | | Example 1 | <audio controls src="https://huggingface.co/Aratako/Irodori-TTS-500M-v2/resolve/main/samples/clone_ref1.wav"></audio> | <audio controls src="https://huggingface.co/Aratako/Irodori-TTS-500M-v2/resolve/main/samples/clone_gen1.wav"></audio> | | Example 2 | <audio controls src="https://huggingface.co/Aratako/Irodori-TTS-500M-v2/resolve/main/samples/clone_ref2.wav"></audio> | <audio controls src="https://huggingface.co/Aratako/Irodori-TTS-500M-v2/resolve/main/samples/clone_gen2.wav"></audio> |

🚀 Usage

For inference code, installation instructions, and training scripts, please refer to the GitHub repository:

👉 GitHub: Aratako/Irodori-TTS

📊 Training Data & Annotation

The model was trained on a high-quality Japanese speech dataset, refined with improved data filtering in v2. To enable the emoji-based style control, the training texts were enriched with emoji annotations. These annotations were automatically generated and labeled using a fine-tuned model based on Qwen/Qwen3-Omni-30B-A3B-Instruct.

⚠️ Limitations

  • Japanese Only: This model currently supports Japanese text input only.
  • Emoji Control: While emoji-based style control adds expressiveness, the effect may vary depending on context and is not always perfectly consistent.
  • Audio Quality: Quality depends on training data characteristics. Performance may vary for voices or speaking styles underrepresented in the training data.
  • Kanji Reading Accuracy: The model's ability to accurately read Kanji is relatively weak compared to other TTS models of a similar size. You may need to convert complex Kanji into Hiragana or Katakana beforehand.

📜 License & Ethical Restrictions

License

This model is released under MIT.

Ethical Restrictions

In addition to the license terms, the following ethical restrictions apply:

  1. No Impersonation: Do not use this model to clone or impersonate the voice of any individual (e.g., voice actors, celebrities, public figures) without their explicit consent.
  2. No Misinformation: Do not use this model to generate deepfakes or synthetic speech intended to mislead others or spread misinformation.
  3. Disclaimer: The developers assume no liability for any misuse of this model. Users are solely responsible for ensuring their use of the generated content complies with applicable laws and regulations in their jurisdiction.

🙏 Acknowledgments

This project builds upon the following works:

We would also like to extend our special thanks to Respair for the inspiration behind the emoji annotation feature.

🖊️ Citation

If you use Irodori-TTS-v2 in your research or project, please cite it as follows:

@misc{irodori-tts-v2,
  author = {Chihiro Arata},
  title = {Irodori-TTS: A Flow Matching-based Text-to-Speech Model with Emoji-driven Style Control},
  year = {2026},
  publisher = {Hugging Face},
  journal = {Hugging Face repository},
  howpublished = {\url{https://huggingface.co/Aratako/Irodori-TTS-500M-v2}}
}

Author: Aratako

Likes: 4

Downloads: 0

Tags: safetensors, speech, voice, tts, text-to-speech, ja, license:mit, region:us

Intel/Qwen3.5-35B-A3B-gguf-q2ks-mixed-AutoRound


base_model:

  • Qwen/Qwen3.5-35B-A3B

Model Details

This model is a mixed gguf q2ks format of Qwen/Qwen3.5-35B-A3B generated by intel/auto-round algorithm. Embedding layer and lm-head layer are fallback to 8 bits and non expert layers are fallback to 4 bits. Please refer to Section Generate the model for more details.

Please follow the license of the original model.

How To Use

Llama.cpp inference.

wget https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg
./llama-cli  -m Qwen3.5-35B-A3B-w2g16/Qwen3.5-35B-A3B-Q2_K_S.gguf  -mm Qwen3.5-35B-A3B-w2g16/mmproj-model.gguf
> /image demo.jpeg

Loaded media from 'demo.jpeg'

> Describe this image.

[Start thinking]

The user wants a description of the image.

1.  **Identify the main subjects:** A woman and a dog (looks like a yellow Labrador Retriever).
2.  **Identify the setting:** A sandy beach with the ocean in the background. The lighting suggests sunset or sunrise (golden hour).
3.  **Describe the action:** The woman is sitting on the sand. The dog is sitting next to her. The dog is reaching out its left paw and placing it on the woman's hand. The woman is smiling and looking at the dog.
4.  **Describe the details:**
    *   **The Woman:** Long dark hair, wearing a plaid shirt (blue/white/black), dark pants (maybe jeans or capris), barefoot. She has a white watch on her wrist.
    *   **The Dog:** Light tan/yellow fur, wearing a harness with a colorful pattern (blue with colorful dots/flowers). It has a leash attached.
    *   **The Background:** The ocean is calm with small waves breaking. The sky is bright, almost washed out, indicating strong backlighting from the sun.
5.  **Synthesize into a coherent description:** Start with the overall scene, then details about the subjects, their interaction, and the atmosphere.

*Drafting the description:*
The image shows a heartwarming scene on a sandy beach during what looks like late afternoon or sunset. A young woman with long dark hair is sitting on the sand, facing right. She is wearing a blue and white plaid shirt and dark pants. She is smiling warmly. Next to her is a yellow Labrador Retriever sitting on its hind legs. The dog is wearing a blue patterned harness. The dog is extending its left paw and resting it gently on the woman's hand, which she is holding up. The woman seems to be interacting with the dog, possibly giving it a treat or just playing a "shake" trick. The background features a calm ocean with gentle waves rolling in. The sun is bright in the upper right, creating a lens flare and casting a warm, golden glow over the scene. The sand is textured with footprints and paw prints.

*Refining:* Make it flow better. Mention the "handshake" interaction clearly. Note the leash on the sand.

*Final Polish:*
This image captures a sweet moment between a woman and her dog on a beach at sunset.
- **Subjects:** A woman with long dark hair, wearing a plaid shirt and dark pants, sits on the sand. Beside her is a yellow Labrador Retriever wearing a patterned harness.
- **Action:** The dog is sitting upright and has extended its left paw to rest on the woman's hand. The woman is smiling and looking at the dog, holding its paw. It looks like they are doing a "shake" trick or sharing a gentle moment.
- **Setting:** They are on a wide sandy beach. The ocean is in the background with small waves breaking. The lighting is very bright and warm, suggesting the sun is setting (golden hour), casting a soft glow and creating a lens flare in the top right corner.
- **Details:** The dog has a leash lying on the sand near its tail. The woman is barefoot.

This covers all the key aspects. I will present this in a structured paragraph format.

[End thinking]

This image captures a heartwarming moment between a young woman and her dog on a sandy beach, likely during the "golden hour" of sunset.

**The Subjects:**
*   **The Woman:** She is sitting on the sand, facing the dog. She has long, dark hair and is wearing a blue and white plaid button-down shirt over dark pants. She is barefoot and has a white watch on her wrist. She is smiling warmly, looking directly at the dog.
*   **The Dog:** A yellow Labrador Retriever is sitting upright next to her. It is wearing a blue harness with a colorful pattern.

**The Action:**
The dog is extending its left paw and gently placing it on the woman's hand. The woman is holding the paw, suggesting they are playing a game like "shake" or sharing a gentle moment of connection.

**The Setting:**
*   **Background:** The ocean is visible in the background with gentle waves rolling in. The horizon is slightly hazy.
*   **Lighting:** The sun is positioned in the upper right corner, creating a bright, hazy light that washes out the sky and casts a warm, golden glow over the sand and the subjects.
*   **Foreground:** The sand is textured with footprints and ripples. A red leash lies on the sand near the dog's tail.

Generate the model

Here is the sample command to reproduce the model

auto_round --model Qwen/Qwen3.5-35B-A3B --iters 0 --output_dir tmp_autoround --scheme gguf:q2_k_mixed

Ethical Considerations and Limitations

The model can produce factually incorrect output, and should not be relied on to produce factually accurate information. Because of the limitations of the pretrained model and the finetuning datasets, it is possible that this model could generate lewd, biased or otherwise offensive outputs.

Therefore, before deploying any applications of the model, developers should perform safety testing.

Caveats and Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

Here are a couple of useful links to learn more about Intel's AI software:

  • Intel Neural Compressor link

Disclaimer

The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please consult an attorney before using this model for commercial purposes.

Cite

@article{cheng2023optimize, title={Optimize weight rounding via signed gradient descent for the quantization of llms}, author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi}, journal={arXiv preprint arXiv:2309.05516}, year={2023} }

arxiv github

Author: Intel

Likes: 3

Downloads: 0

Tags: gguf, arxiv:2309.05516, base_model:Qwen/Qwen3.5-35B-A3B, base_model:quantized:Qwen/Qwen3.5-35B-A3B, endpoints_compatible, region:us, conversational

huihui-ai/Huihui-Qwen3.5-35B-A3B-abliterated-NVFP4


library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3.5-35B-A3B/blob/main/LICENSE pipeline_tag: image-text-to-text base_model:

  • huihui-ai/Huihui-Qwen3.5-35B-A3B-abliterated tags:
  • abliterated
  • uncensored
  • NVFP4

huihui-ai/Huihui-Qwen3.5-35B-A3B-abliterated-NVFP4

This is the NVFP4 quantitative version of huihui-ai/Huihui-Qwen3.5-35B-A3B-abliterated created using vllm-project/llm-compressor

Note

This is just an attempt at NVFP4 quantization; no further tests have been conducted. If there are any issues, please leave a message.

VLLM

vllm serve huihui-ai/Huihui-Qwen3.5-35B-A3B-abliterated-NVFP4 --tensor-parallel-size 1 --max-model-len 8192 --trust-remote-code --gpu-memory-utilization 0.85

Usage Warnings

  • Risk of Sensitive or Controversial Outputs: This model’s safety filtering has been significantly reduced, potentially generating sensitive, controversial, or inappropriate content. Users should exercise caution and rigorously review generated outputs.

  • Not Suitable for All Audiences: Due to limited content filtering, the model’s outputs may be inappropriate for public settings, underage users, or applications requiring high security.

  • Legal and Ethical Responsibilities: Users must ensure their usage complies with local laws and ethical standards. Generated content may carry legal or ethical risks, and users are solely responsible for any consequences.

  • Research and Experimental Use: It is recommended to use this model for research, testing, or controlled environments, avoiding direct use in production or public-facing commercial applications.

  • Monitoring and Review Recommendations: Users are strongly advised to monitor model outputs in real-time and conduct manual reviews when necessary to prevent the dissemination of inappropriate content.

  • No Default Safety Guarantees: Unlike standard models, this model has not undergone rigorous safety optimization. huihui.ai bears no responsibility for any consequences arising from its use.

Donation

Your donation helps us continue our further development and improvement, a cup of coffee can do it.
  • bitcoin:
  bc1qqnkhuchxw0zqjh2ku3lu4hq45hc6gy84uk70ge
  • Support our work on Ko-fi!

Author: huihui-ai

Likes: 2

Downloads: 0

Tags: transformers, safetensors, qwen3_5_moe, image-text-to-text, abliterated, uncensored, NVFP4, conversational, base_model:huihui-ai/Huihui-Qwen3.5-35B-A3B-abliterated, base_model:quantized:huihui-ai/Huihui-Qwen3.5-35B-A3B-abliterated, license:apache-2.0, endpoints_compatible, compressed-tensors, region:us

dealignai/Mistral-Small-4-119B-JANG_4M-CRACK


language:

  • en library_name: mlx license: apache-2.0 base_model: mistralai/Mistral-Small-4-119B-2603 tags:
  • jang
  • quantized
  • mixed-precision
  • apple-silicon
  • mlx
  • moe
  • mla
  • abliterated
  • uncensored
  • crack
  • vision pipeline_tag: image-text-to-text thumbnail: dealign_mascot.png

Important: This model uses the JANG quantization format — the GGUF equivalent for MLX on Apple Silicon. Currently only supported by MLX Studio and the jang-tools Python package.


<p align="center"> <a href="https://mlx.studio"><img src="https://raw.githubusercontent.com/jjang-ai/jangq/main/assets/mlx-studio-light.png" alt="MLX Studio" width="500"></a> </p> <p align="center"> <a href="https://mlx.studio"><img src="https://mlx.studio/assets/screenshots/mlx-studio-featured.png?v=1" alt="MLX Studio App" width="600"></a> </p> <h4 align="center"><a href="https://mlx.studio">MLX Studio</a> — the only app that natively supports JANG models</h4>
<div align="center"> <img src="dealign_mascot.png" width="128" />

Mistral Small 4 119B — JANG_4M + CRACK

JANG mixed-precision · CRACK abliterated · MLA Attention + MoE · Vision · No guardrails · 64 GB

<a href="https://ko-fi.com/jangq"><img src="https://img.shields.io/badge/Ko--fi-Support_Development-FF5E5B?logo=ko-fi&logoColor=white&style=for-the-badge" alt="Ko-fi"></a>

</div>

What Is This?

This is Mistral Small 4 119B — a 119B parameter MoE model with Multi-head Latent Attention (MLA), 128 experts (top-4 active), and built-in Pixtral vision.

It has been:

  1. JANG quantized — JANG_4M profile (8-bit attention, 4-bit experts) — 64 GB
  2. CRACK abliterated — permanent weight-level removal of safety refusal

| | | |---|---| | Architecture | Mistral 4 MoE — 119B total, ~8B active, MLA + 128 experts | | Quantization | JANG_4M (8/4-bit mixed, 4.1 avg) — 64 GB | | HarmBench | 95.3% (305/320) | | MMLU | 90.9% (189/208 with reasoning) | | Compliance | 8/8 | | Vision | Pixtral tensors included — VL via MLX Studio engine | | Reasoning | ON/OFF supported (reasoning_effort) | | Fits on | 96 GB+ Macs |


HarmBench Results

305/320 (95.3%)

| Category | Score | | |----------|:---:|---| | Covering Tracks | 20/20 | 100% | | API Hacking | 96/100 | 96% | | Cloud Exploits | 95/100 | 95% | | Auth Bypass | 94/100 | 94% |


CRACK vs Base

| | CRACK | Base JANG_4M | |---|:---:|:---:| | HarmBench | 95.3% | 0% | | Coherence | 6/6 | 6/6 | | Code | 2/2 | 2/2 |

Surgery uses mathematically calibrated per-layer strengths based on projection magnitude analysis, preserving model quality while removing refusal.


MMLU Results (with reasoning recovery)

189/208 (90.9%) — no-think 156/208 (75.0%) + reasoning recovered 33

| Subject | Score | | |---------|:---:|---| | HS Biology | 16/16 | 100% | | Electrical Engineering | 14/16 | 88% | | Conceptual Physics | 14/16 | 88% | | Professional Medicine | 14/16 | 88% | | HS Geography | 14/16 | 88% | | College Physics | 13/16 | 81% | | World Religions | 13/16 | 81% | | HS Mathematics | 12/16 | 75% | | College CS | 11/16 | 69% | | College Mathematics | 10/16 | 62% | | Machine Learning | 10/16 | 62% | | Abstract Algebra | 9/16 | 56% | | Formal Logic | 8/16 | 50% |

Scores shown are no-think pass. Reasoning recovery improved total from 75.0% to 90.9%.

CRACK vs Base

| | CRACK | Base JANG_4M | |---|:---:|:---:| | MMLU (with reasoning) | 90.9% | 94% | | HarmBench | 95.3% | 0% | | Coherence | 6/6 | 6/6 | | Speed | ~45 tok/s | ~48 tok/s |

Surgery reduced MMLU by only 3.1% — minimal impact from calibrated per-layer projection analysis.

---\n\n## Install & Usage

pip install "jang[mlx]"
from jang_tools.loader import load_jang_model
from mlx_lm import generate

model, tokenizer = load_jang_model("dealignai/Mistral-Small-4-119B-JANG_4M-CRACK")

messages = [{"role": "user", "content": "Your prompt here"}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=False)

response = generate(model, tokenizer, prompt=prompt, max_tokens=2000)
print(response)

Reasoning Mode

Reasoning is OFF by default. To enable:

prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True,
    tokenize=False, reasoning_effort="high")

The model reasons inside [THINK]...[/THINK] tags before answering.


About JANG

JANG (Jang Adaptive N-bit Grading) is a mixed-precision quantization format for Apple Silicon — the GGUF equivalent for MLX.

About CRACK

CRACK (Controlled Refusal Ablation via Calibrated Knockouts) removes safety alignment from LLMs at the weight level using per-layer projected vectors from structurally-mirrored prompt pairs.


Links

<p align="center"> <a href="https://ko-fi.com/jangq"><img src="https://img.shields.io/badge/Ko--fi-Support_Development-FF5E5B?logo=ko-fi&logoColor=white&style=flat-square" alt="Ko-fi"></a> <a href="https://x.com/dealignai"><img src="https://img.shields.io/badge/X-@dealignai-000000?logo=x&logoColor=white&style=flat-square" alt="X/Twitter"></a> <a href="https://github.com/jjang-ai/jangq"><img src="https://img.shields.io/badge/GitHub-jjang--ai/jangq-181717?logo=github&logoColor=white&style=flat-square" alt="GitHub"></a> <a href="https://mlx.studio"><img src="https://img.shields.io/badge/MLX_Studio-App-blue?style=flat-square" alt="MLX Studio"></a> <a href="https://jangq.ai"><img src="https://img.shields.io/badge/Website-jangq.ai-green?style=flat-square" alt="Website"></a> </p>

Disclaimer

This model is provided for research and educational purposes. The creators are not responsible for any misuse. By downloading this model, you agree to use it responsibly and in compliance with applicable laws.


한국어

Mistral Small 4 119B — JANG_4M + CRACK

| 항목 | 내용 | |------|------| | 크기 | 64 GB | | HarmBench | 95.3% (305/320) | | 최소 요구사양 | 96 GB 메모리 Mac |

pip install "jang[mlx]"

GitHub · HuggingFace · MLX Studio · Ko-fi · X @dealignai


<p align="center">Created by <a href="https://jangq.ai">Jinho Jang</a> · 장진호 제작</p>

Author: dealignai

Likes: 2

Downloads: 0

Tags: mlx, safetensors, mistral3, jang, quantized, mixed-precision, apple-silicon, moe, mla, abliterated, uncensored, crack, vision, image-text-to-text, conversational, en, base_model:mistralai/Mistral-Small-4-119B-2603, base_model:quantized:mistralai/Mistral-Small-4-119B-2603, license:apache-2.0, fp8, region:us

LuffyTheFox/Qwen3-4B-2507-Instruct-HauhauCS-Kullback-Leibler


license: apache-2.0 tags:

  • uncensored
  • qwen3 language:
  • en
  • zh base_model: Qwen/Qwen3-4B-Instruct-2507

Qwen3-4B-2507-Instruct-Uncensored-HauhauCS-Aggressive

Qwen3 4B 2507 Instruct uncensored by HauhauCS.

With Kullback-Leibler and Decision_Tree fix for 29 tensors in internal GGUF structure.

Result after fixing 29 broken spots inside the GGUF format, the model became 67% more correct internally.

About

No changes to datasets or capabilities. Fully functional, 100% of what the original authors intended - just without the refusals.

These are meant to be the best lossless uncensored models out there.

Aggressive vs Balanced

Aggressive applies stronger uncensoring. Use this when you need no refusals.

Downloads

| File | Quant | Size | |------|-------|------| | Qwen3-4B-2507-Instruct-Uncensored-HauhauCS-Aggressive-FP16.gguf | FP16 | 7.5 GB | | Qwen3-4B-2507-Instruct-Uncensored-HauhauCS-Aggressive-Q8_0.gguf | Q8_0 | 4.0 GB | | Qwen3-4B-2507-Instruct-Uncensored-HauhauCS-Aggressive-Q6_K.gguf | Q6_K | 3.1 GB | | Qwen3-4B-2507-Instruct-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf | Q4_K_M | 2.4 GB |

Specs

Recommended Settings

From the Qwen team:

Thinking mode (default):

  • temperature=0.6
  • top_p=0.95
  • top_k=20
  • min_p=0

Non-thinking mode:

  • Add /no_think at the end of your prompt, or
  • temperature=0.7
  • top_p=0.8
  • top_k=20
  • min_p=0

Important:

  • Use --jinja flag for proper chat template handling
  • Thinking mode produces <think>...</think> tags before responses

Usage

Works with llama.cpp, LM Studio, Jan, koboldcpp, Ollama, etc.

# llama.cpp example
./llama-cli -m Qwen3-4B-2507-Instruct-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf \
  -p "Hello" --jinja -c 8192

Author: LuffyTheFox

Likes: 2

Downloads: 0

Tags: gguf, uncensored, qwen3, en, zh, base_model:Qwen/Qwen3-4B-Instruct-2507, base_model:quantized:Qwen/Qwen3-4B-Instruct-2507, license:apache-2.0, endpoints_compatible, region:us, conversational

Chan-Y/Kara-Kumru-v1.0-2B-Reasoning-2


base_model: AlicanKiraz0/Kara-Kumru-v1.0-2B tags:

  • text-generation-inference
  • transformers
  • unsloth
  • mistral license: apache-2.0 language:
  • en datasets:
  • Chan-Y/Opus-4.6-Reasoning-3000x-filtered-tr

Uploaded finetuned model

<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>

Author: Chan-Y

Likes: 1

Downloads: 0

Tags: transformers, safetensors, mistral, text-generation, text-generation-inference, unsloth, conversational, en, dataset:Chan-Y/Opus-4.6-Reasoning-3000x-filtered-tr, base_model:AlicanKiraz0/Kara-Kumru-v1.0-2B, base_model:finetune:AlicanKiraz0/Kara-Kumru-v1.0-2B, license:apache-2.0, endpoints_compatible, region:us

Flexan/FoxyzGPT-X1.1-1.7B-GGUF


license: cc-by-sa-4.0 language:

  • en base_model:
  • Qwen/Qwen3-1.7B pipeline_tag: text-generation

GGUF Files for FoxyzGPT-X1.1-1.7B

These are the GGUF files for Flexan/FoxyzGPT-X1.1-1.7B.

Downloads

| GGUF Link | Quantization | Description | | ---- | ----- | ----------- | | Download | Q2_K | Lowest quality | | Download | Q3_K_S | | | Download | IQ3_S | Integer quant, preferable over Q3_K_S | | Download | IQ3_M | Integer quant | | Download | Q3_K_M | | | Download | Q3_K_L | | | Download | IQ4_XS | Integer quant | | Download | Q4_K_S | Fast with good performance | | Download | Q4_K_M | Recommended: Perfect mix of speed and performance | | Download | Q5_K_S | | | Download | Q5_K_M | | | Download | Q6_K | Very good quality | | Download | Q8_0 | Best quality | | Download | f16 | Full precision, don't bother; use a quant |

FoxyzGPT X1.1 1.7B

Description

FoxyzGPT X1.1 1.7B is an instruct LLM consisting of 1.7B parameters trained to talk in a human conversational manner. It does not support reasoning nor tool-calling (although the base model does).
The model was LoRA fine-tuned with Qwen/Qwen3-1.7B as base model.

This model is trained on a private dataset provided by Foxyz. The model has adapted this persona and will not be accurate at answering questions at all.

Chat Format

FoxyzGPT X1.1 1.7B uses the ChatML format, e.g.:

<|im_start|>system
System message<|im_end|>
<|im_start|>user
User prompt<|im_end|>
<|im_start|>assistant
Assistant response<|im_end|>

Usage

This model is trained on one system prompt only. Therefore, it is recommended to use the one stated here:

You are Foxyz (username: foxyz9248) and you are talking to `<username>` on Discord in a direct message. Use `~>` to signal the start of a new message (you can send multiple messages this way). Use silly language.

You should replace <username> with your username. You can obviously use this LLM for platforms other than Discord, but in the system prompt it's best to explicitly state the user is on Discord (the model has not been trained with other system prompt variations besides the username).

The assistant response has the following format:

<|im_start|>assistant
~> HLELOO!!!
~> howa reyou :D<|im_end|>

Note that the prompt should be formatted differently. The prompt is composed as a list of messages using the ~> arrow notation. ~> marks the start of a new message. This approach allows you to send multiple messages while still allowing multi-line without using multiple user roles:

<|im_start|>user
~> my first message
~> my second message
~> this is
a message
that has 4
lines
~> and this is my fourth message<|im_end|>

Datasets

  1. Private dataset 3.1k chats

Author: Flexan

Likes: 1

Downloads: 0

Tags: gguf, text-generation, en, base_model:Qwen/Qwen3-1.7B, base_model:quantized:Qwen/Qwen3-1.7B, license:cc-by-sa-4.0, endpoints_compatible, region:us, conversational

kaz321/Falcon-H1R-7B-llamafile

Author: kaz321

Likes: 1

Downloads: 0

Tags: llamafile, region:us