microsoft/harrier-oss-v1-0.6b
tags:
- mteb
- sentence-transformers
- transformers
language:
- multilingual
- af
- am
- ar
- as
- az
- be
- bg
- bn
- br
- bs
- ca
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fr
- fy
- ga
- gd
- gl
- gu
- ha
- he
- hi
- hr
- hu
- hy
- id
- is
- it
- ja
- jv
- ka
- kk
- km
- kn
- ko
- ku
- ky
- la
- lo
- lt
- lv
- mg
- mk
- ml
- mn
- mr
- ms
- my
- ne
- nl
- 'no'
- om
- or
- pa
- pl
- ps
- pt
- ro
- ru
- sa
- sd
- si
- sk
- sl
- so
- sq
- sr
- su
- sv
- sw
- ta
- te
- th
- tl
- tr
- ug
- uk
- ur
- uz
- vi
- xh
- yi
- zh
license: mit
harrier-oss-v1
harrier-oss-v1 is a family of multilingual text embedding models developed by Microsoft.
The models use decoder-only architectures with last-token pooling and L2 normalization to produce dense text embeddings.
They can be applied to a wide range of tasks, including but not limited to retrieval, clustering, semantic similarity, classification, bitext mining, and reranking.
The models achieve state-of-the-art results on the Multilingual MTEB v2 benchmark as of the release date.
| Model | Parameters | Embedding Dimension | Max Tokens | MTEB v2 Score |
|-----------------------------------------------------------------------------|------------|---------------------|------------|---------------|
| harrier-oss-v1-270m | 270M | 640 | 32,768 | 66.5 |
| harrier-oss-v1-0.6b | 0.6B | 1,024 | 32,768 | 69.0 |
| harrier-oss-v1-27b | 27B | 5,376 | 32,768 | 74.3 |
Training
All models are trained with contrastive learning objectives on a large-scale mixture of multilingual datasets covering diverse tasks.
The 270m and 0.6b variants are additionally trained with knowledge distillation from larger embedding models.
Usage
Below is an example to encode queries and passages from the MS-MARCO passage ranking dataset.
Sentence Transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("microsoft/harrier-oss-v1-0.6b", model_kwargs={"dtype": "auto"})
queries = [
"how much protein should a female eat",
"summit define",
]
documents = [
"As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
"Definition of summit for English Language Learners. : 1 the highest point of a mountain : the top of a mountain. : 2 the highest level. : 3 a meeting or series of meetings between the leaders of two or more governments."
]
query_embeddings = model.encode(queries, prompt_name="web_search_query")
document_embeddings = model.encode(documents)
scores = (query_embeddings @ document_embeddings.T) * 100
print(scores.tolist())
Have a look at config_sentence_transformers.json for the prompts that are pre-configured, such as web_search_query, sts_query, and bitext_query. You can also use a custom instruction directly via e.g. model.encode(queries, prompt="Instruct: Retrieve semantically similar text\nQuery: ").
Transformers
import torch
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
def last_token_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
if left_padding:
return last_hidden_states[:, -1]
else:
sequence_lengths = attention_mask.sum(dim=1) - 1
batch_size = last_hidden_states.shape[0]
return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
def get_detailed_instruct(task_description: str, query: str) -> str:
return f'Instruct: {task_description}\nQuery: {query}'
# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = [
get_detailed_instruct(task, 'how much protein should a female eat'),
get_detailed_instruct(task, 'summit define')
]
# No need to add instruction for retrieval documents
documents = [
"As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
"Definition of summit for English Language Learners. : 1 the highest point of a mountain : the top of a mountain. : 2 the highest level. : 3 a meeting or series of meetings between the leaders of two or more governments."
]
input_texts = queries + documents
tokenizer = AutoTokenizer.from_pretrained('microsoft/harrier-oss-v1-0.6b')
model = AutoModel.from_pretrained('microsoft/harrier-oss-v1-0.6b', dtype='auto')
model.eval()
model.cuda()
max_length = 32768
# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=max_length, padding=True, truncation=True, return_tensors='pt')
batch_dict = {k: v.cuda() for k, v in batch_dict.items()}
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())
Supported Languages
The models are trained on multilingual data and support a wide range of languages,
including but not limited to: Arabic, Bulgarian, Catalan, Czech, Danish, German, Greek, English, Spanish,
Estonian, Persian, Finnish, French, Hebrew, Hindi, Croatian, Hungarian, Indonesian, Italian, Japanese,
Korean, Lithuanian, Latvian, Macedonian, Malay, Dutch, Norwegian, Polish, Portuguese, Romanian, Russian,
Slovak, Slovenian, Albanian, Serbian, Swedish, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Chinese.
Evaluation
Please follow the mteb repository on how to reproduce our scores.
The evaluation prompts used for each task are also available at mteb_v2_eval_prompts.json.
FAQ
1. Do I need to add instructions to the query?
Yes, this is how the model is trained, otherwise you will see a performance degradation.
The task definition should be a one-sentence instruction that describes the task.
This is a way to customize text embeddings for different scenarios through natural language instructions.
On the other hand, there is no need to add instructions to the document side.
2. Why are my reproduced results slightly different from reported in the model card?
Different versions of transformers and pytorch could cause negligible but non-zero performance differences.
3. What pooling strategy does this model use?
The model uses last-token pooling — the embedding of the last non-padding token is used as the sentence representation.
The embedding is then L2-normalized. This is handled automatically when using Sentence Transformers.
Author: microsoft
Likes: 75
Downloads: 0
Tags: sentence-transformers, safetensors, qwen3, feature-extraction, mteb, transformers, multilingual, af, am, ar, as, az, be, bg, bn, br, bs, ca, cs, cy, da, de, el, en, eo, es, et, eu, fa, fi, fr, fy, ga, gd, gl, gu, ha, he, hi, hr, hu, hy, id, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lo, lt, lv, mg, mk, ml, mn, mr, ms, my, ne, nl, no, om, or, pa, pl, ps, pt, ro, ru, sa, sd, si, sk, sl, so, sq, sr, su, sv, sw, ta, te, th, tl, tr, ug, uk, ur, uz, vi, xh, yi, zh, license:mit, text-embeddings-inference, endpoints_compatible, region:us
microsoft/harrier-oss-v1-270m
tags:
- mteb
- sentence-transformers
- transformers
language:
- multilingual
- af
- am
- ar
- as
- az
- be
- bg
- bn
- br
- bs
- ca
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fr
- fy
- ga
- gd
- gl
- gu
- ha
- he
- hi
- hr
- hu
- hy
- id
- is
- it
- ja
- jv
- ka
- kk
- km
- kn
- ko
- ku
- ky
- la
- lo
- lt
- lv
- mg
- mk
- ml
- mn
- mr
- ms
- my
- ne
- nl
- 'no'
- om
- or
- pa
- pl
- ps
- pt
- ro
- ru
- sa
- sd
- si
- sk
- sl
- so
- sq
- sr
- su
- sv
- sw
- ta
- te
- th
- tl
- tr
- ug
- uk
- ur
- uz
- vi
- xh
- yi
- zh
license: mit
harrier-oss-v1
harrier-oss-v1 is a family of multilingual text embedding models developed by Microsoft.
The models use decoder-only architectures with last-token pooling and L2 normalization to produce dense text embeddings.
They can be applied to a wide range of tasks, including but not limited to retrieval, clustering, semantic similarity, classification, bitext mining, and reranking.
The models achieve state-of-the-art results on the Multilingual MTEB v2 benchmark as of the release date.
| Model | Parameters | Embedding Dimension | Max Tokens | MTEB v2 Score |
|-----------------------------------------------------------------------------|------------|---------------------|------------|---------------|
| harrier-oss-v1-270m | 270M | 640 | 32,768 | 66.5 |
| harrier-oss-v1-0.6b | 0.6B | 1,024 | 32,768 | 69.0 |
| harrier-oss-v1-27b | 27B | 5,376 | 32,768 | 74.3 |
Training
All models are trained with contrastive learning objectives on a large-scale mixture of multilingual datasets covering diverse tasks.
The 270m and 0.6b variants are additionally trained with knowledge distillation from larger embedding models.
Usage
Below is an example to encode queries and passages from the MS-MARCO passage ranking dataset.
Sentence Transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("microsoft/harrier-oss-v1-270m", model_kwargs={"dtype": "auto"})
queries = [
"how much protein should a female eat",
"summit define",
]
documents = [
"As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
"Definition of summit for English Language Learners. : 1 the highest point of a mountain : the top of a mountain. : 2 the highest level. : 3 a meeting or series of meetings between the leaders of two or more governments."
]
query_embeddings = model.encode(queries, prompt_name="web_search_query")
document_embeddings = model.encode(documents)
scores = (query_embeddings @ document_embeddings.T) * 100
print(scores.tolist())
Have a look at config_sentence_transformers.json for the prompts that are pre-configured, such as web_search_query, sts_query, and bitext_query. You can also use a custom instruction directly via e.g. model.encode(queries, prompt="Instruct: Retrieve semantically similar text\nQuery: ").
Transformers
import torch
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
def last_token_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
if left_padding:
return last_hidden_states[:, -1]
else:
sequence_lengths = attention_mask.sum(dim=1) - 1
batch_size = last_hidden_states.shape[0]
return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
def get_detailed_instruct(task_description: str, query: str) -> str:
return f'Instruct: {task_description}\nQuery: {query}'
# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = [
get_detailed_instruct(task, 'how much protein should a female eat'),
get_detailed_instruct(task, 'summit define')
]
# No need to add instruction for retrieval documents
documents = [
"As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
"Definition of summit for English Language Learners. : 1 the highest point of a mountain : the top of a mountain. : 2 the highest level. : 3 a meeting or series of meetings between the leaders of two or more governments."
]
input_texts = queries + documents
tokenizer = AutoTokenizer.from_pretrained('microsoft/harrier-oss-v1-270m')
model = AutoModel.from_pretrained('microsoft/harrier-oss-v1-270m', dtype='auto')
model.eval()
model.cuda()
max_length = 32768
# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=max_length, padding=True, truncation=True, return_tensors='pt')
batch_dict = {k: v.cuda() for k, v in batch_dict.items()}
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())
Supported Languages
The models are trained on multilingual data and support a wide range of languages,
including but not limited to: Arabic, Bulgarian, Catalan, Czech, Danish, German, Greek, English, Spanish,
Estonian, Persian, Finnish, French, Hebrew, Hindi, Croatian, Hungarian, Indonesian, Italian, Japanese,
Korean, Lithuanian, Latvian, Macedonian, Malay, Dutch, Norwegian, Polish, Portuguese, Romanian, Russian,
Slovak, Slovenian, Albanian, Serbian, Swedish, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Chinese.
Evaluation
Please follow the mteb repository on how to reproduce our scores.
The evaluation prompts used for each task are also available at mteb_v2_eval_prompts.json.
FAQ
1. Do I need to add instructions to the query?
Yes, this is how the model is trained, otherwise you will see a performance degradation.
The task definition should be a one-sentence instruction that describes the task.
This is a way to customize text embeddings for different scenarios through natural language instructions.
On the other hand, there is no need to add instructions to the document side.
2. Why are my reproduced results slightly different from reported in the model card?
Different versions of transformers and pytorch could cause negligible but non-zero performance differences.
3. What pooling strategy does this model use?
The model uses last-token pooling — the embedding of the last non-padding token is used as the sentence representation.
The embedding is then L2-normalized. This is handled automatically when using Sentence Transformers.
Author: microsoft
Likes: 43
Downloads: 0
Tags: sentence-transformers, safetensors, gemma3_text, feature-extraction, mteb, transformers, multilingual, af, am, ar, as, az, be, bg, bn, br, bs, ca, cs, cy, da, de, el, en, eo, es, et, eu, fa, fi, fr, fy, ga, gd, gl, gu, ha, he, hi, hr, hu, hy, id, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lo, lt, lv, mg, mk, ml, mn, mr, ms, my, ne, nl, no, om, or, pa, pl, ps, pt, ro, ru, sa, sd, si, sk, sl, so, sq, sr, su, sv, sw, ta, te, th, tl, tr, ug, uk, ur, uz, vi, xh, yi, zh, license:mit, text-embeddings-inference, endpoints_compatible, region:us
microsoft/harrier-oss-v1-27b
tags:
- mteb
- sentence-transformers
- transformers
language:
- multilingual
- af
- am
- ar
- as
- az
- be
- bg
- bn
- br
- bs
- ca
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fr
- fy
- ga
- gd
- gl
- gu
- ha
- he
- hi
- hr
- hu
- hy
- id
- is
- it
- ja
- jv
- ka
- kk
- km
- kn
- ko
- ku
- ky
- la
- lo
- lt
- lv
- mg
- mk
- ml
- mn
- mr
- ms
- my
- ne
- nl
- 'no'
- om
- or
- pa
- pl
- ps
- pt
- ro
- ru
- sa
- sd
- si
- sk
- sl
- so
- sq
- sr
- su
- sv
- sw
- ta
- te
- th
- tl
- tr
- ug
- uk
- ur
- uz
- vi
- xh
- yi
- zh
license: mit
harrier-oss-v1
harrier-oss-v1 is a family of multilingual text embedding models developed by Microsoft.
The models use decoder-only architectures with last-token pooling and L2 normalization to produce dense text embeddings.
They can be applied to a wide range of tasks, including but not limited to retrieval, clustering, semantic similarity, classification, bitext mining, and reranking.
The models achieve state-of-the-art results on the Multilingual MTEB v2 benchmark as of the release date.
| Model | Parameters | Embedding Dimension | Max Tokens | MTEB v2 Score |
|-----------------------------------------------------------------------------|------------|---------------------|------------|---------------|
| harrier-oss-v1-270m | 270M | 640 | 32,768 | 66.5 |
| harrier-oss-v1-0.6b | 0.6B | 1,024 | 32,768 | 69.0 |
| harrier-oss-v1-27b | 27B | 5,376 | 32,768 | 74.3 |
Training
All models are trained with contrastive learning objectives on a large-scale mixture of multilingual datasets covering diverse tasks.
The 270m and 0.6b variants are additionally trained with knowledge distillation from larger embedding models.
Usage
Below is an example to encode queries and passages from the MS-MARCO passage ranking dataset.
Sentence Transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("microsoft/harrier-oss-v1-27b", model_kwargs={"dtype": "auto"})
queries = [
"how much protein should a female eat",
"summit define",
]
documents = [
"As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
"Definition of summit for English Language Learners. : 1 the highest point of a mountain : the top of a mountain. : 2 the highest level. : 3 a meeting or series of meetings between the leaders of two or more governments."
]
query_embeddings = model.encode(queries, prompt_name="web_search_query")
document_embeddings = model.encode(documents)
scores = (query_embeddings @ document_embeddings.T) * 100
print(scores.tolist())
Have a look at config_sentence_transformers.json for the prompts that are pre-configured, such as web_search_query, sts_query, and bitext_query. You can also use a custom instruction directly via e.g. model.encode(queries, prompt="Instruct: Retrieve semantically similar text\nQuery: ").
Transformers
import torch
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
def last_token_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
if left_padding:
return last_hidden_states[:, -1]
else:
sequence_lengths = attention_mask.sum(dim=1) - 1
batch_size = last_hidden_states.shape[0]
return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
def get_detailed_instruct(task_description: str, query: str) -> str:
return f'Instruct: {task_description}\nQuery: {query}'
# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = [
get_detailed_instruct(task, 'how much protein should a female eat'),
get_detailed_instruct(task, 'summit define')
]
# No need to add instruction for retrieval documents
documents = [
"As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
"Definition of summit for English Language Learners. : 1 the highest point of a mountain : the top of a mountain. : 2 the highest level. : 3 a meeting or series of meetings between the leaders of two or more governments."
]
input_texts = queries + documents
tokenizer = AutoTokenizer.from_pretrained('microsoft/harrier-oss-v1-27b')
model = AutoModel.from_pretrained('microsoft/harrier-oss-v1-27b', dtype='auto')
model.eval()
model.cuda()
max_length = 32768
# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=max_length, padding=True, truncation=True, return_tensors='pt')
batch_dict = {k: v.cuda() for k, v in batch_dict.items()}
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())
Supported Languages
The models are trained on multilingual data and support a wide range of languages,
including but not limited to: Arabic, Bulgarian, Catalan, Czech, Danish, German, Greek, English, Spanish,
Estonian, Persian, Finnish, French, Hebrew, Hindi, Croatian, Hungarian, Indonesian, Italian, Japanese,
Korean, Lithuanian, Latvian, Macedonian, Malay, Dutch, Norwegian, Polish, Portuguese, Romanian, Russian,
Slovak, Slovenian, Albanian, Serbian, Swedish, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Chinese.
Evaluation
Please follow the mteb repository on how to reproduce our scores.
The evaluation prompts used for each task are also available at mteb_v2_eval_prompts.json.
FAQ
1. Do I need to add instructions to the query?
Yes, this is how the model is trained, otherwise you will see a performance degradation.
The task definition should be a one-sentence instruction that describes the task.
This is a way to customize text embeddings for different scenarios through natural language instructions.
On the other hand, there is no need to add instructions to the document side.
2. Why are my reproduced results slightly different from reported in the model card?
Different versions of transformers and pytorch could cause negligible but non-zero performance differences.
3. What pooling strategy does this model use?
The model uses last-token pooling — the embedding of the last non-padding token is used as the sentence representation.
The embedding is then L2-normalized. This is handled automatically when using Sentence Transformers.
Author: microsoft
Likes: 40
Downloads: 0
Tags: sentence-transformers, safetensors, gemma3_text, feature-extraction, mteb, transformers, multilingual, af, am, ar, as, az, be, bg, bn, br, bs, ca, cs, cy, da, de, el, en, eo, es, et, eu, fa, fi, fr, fy, ga, gd, gl, gu, ha, he, hi, hr, hu, hy, id, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lo, lt, lv, mg, mk, ml, mn, mr, ms, my, ne, nl, no, om, or, pa, pl, ps, pt, ro, ru, sa, sd, si, sk, sl, so, sq, sr, su, sv, sw, ta, te, th, tl, tr, ug, uk, ur, uz, vi, xh, yi, zh, license:mit, text-embeddings-inference, endpoints_compatible, region:us
meituan-longcat/LongCat-AudioDiT-3.5B
license: mit
language:
LongCat-AudioDiT: High-Fidelity Diffusion Text-to-Speech in the Waveform Latent Space
<div align="center">
<img src="./LongCat-AudioDiT.svg" width="45%" alt="LongCat-AudioDiT" />
</div>
<hr>
<div align="center" style="line-height: 1;">
<a href="https://github.com/meituan-longcat/LongCat-AudioDiT/blob/main/LongCat-AudioDiT.pdf">
<img alt="License" src="https://img.shields.io/badge/Paper-LongCatAudioDiT-blue" style="display: inline-block; vertical-align: middle;"/>
</a>
<a href="https://github.com/meituan-longcat/LongCat-AudioDiT" target="_blank" style="margin: 2px;">
<img alt="GitHub" src="https://img.shields.io/badge/GitHub-LongCatAudioDiT-white?logo=github&logoColor=white&color=a4b5d5" style="display: inline-block; vertical-align: middle;"/>
</a>
</div>
<div align="center" style="line-height: 1;">
<a href="https://huggingface.co/meituan-longcat/LongCat-AudioDiT-3.5B" target="_blank" style="margin: 2px;">
<img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-LongCatAudioDiT3.5B-ffc107?color=ffc107&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
</a>
<a href="https://huggingface.co/meituan-longcat/LongCat-AudioDiT-1B" target="_blank" style="margin: 2px;">
<img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-LongCatAudioDiT1B-ffc107?color=ffc107&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
</a>
</div>
<div align="center" style="line-height: 1;">
<a href="https://github.com/meituan-longcat/LongCat-AudioDiT/blob/main/assets/wechat_official_accounts.png" target="_blank" style="margin: 2px;">
<img alt="Wechat" src="https://img.shields.io/badge/WeChat-LongCat-brightgreen?logo=wechat&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
</a>
<a href="https://x.com/Meituan_LongCat" target="_blank" style="margin: 2px;">
<img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-LongCat-white?logo=x&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
</a>
<a href="https://github.com/meituan-longcat/LongCat-AudioDiT/blob/main/LICENSE" style="margin: 2px;">
<img alt="License" src="https://img.shields.io/badge/License-MIT-f5de53?&color=f5de53" style="display: inline-block; vertical-align: middle;"/>
</a>
</div>
Introduction
LongCat-AudioDiT is a state-of-the-art (SOTA) diffusion-based text-to-speech (TTS) model that directly operates on the waveform latent space.
Abstract: We present LongCat-TTS, a novel, non-autoregressive diffusion-based text-to-speech (TTS) model that achieves state-of-the-art (SOTA) performance.
Unlike previous methods that rely on intermediate acoustic representations such as mel-spectrograms, the core innovation of LongCat-TTS lies in operating directly within the waveform latent space. This approach effectively mitigates compounding errors and drastically simplifies the TTS pipeline, requiring only a waveform variational autoencoder (Wav-VAE) and a diffusion backbone.
Furthermore, we introduce two critical improvements to the inference process: first, we identify and rectify a long-standing training-inference mismatch; second, we replace traditional classifier-free guidance with adaptive projection guidance to elevate generation quality.
Experimental results demonstrate that, despite the absence of complex multi-stage training pipelines or high-quality human-annotated datasets, LongCat-TTS achieves SOTA zero-shot voice cloning performance on the Seed benchmark while maintaining competitive intelligibility.
Specifically, our largest variant, LongCat-TTS-3.5B, outperforms the previous SOTA model (Seed-TTS), improving the speaker similarity (SIM) scores from 0.809 to 0.818 on Seed-ZH, and from 0.776 to 0.797 on Seed-Hard.
Finally, through comprehensive ablation studies and systematic analysis, we validate the effectiveness of our proposed modules.
Notably, we investigate the interplay between the Wav-VAE and the TTS backbone, revealing the counterintuitive finding that superior reconstruction fidelity in the Wav-VAE does not necessarily lead to better overall TTS performance.
Code and model weights are released to foster further research within the speech community.
<div align="center">
<img src="./architecture.png" width="75%" alt="LongCat-AudioDiT" />
</div>
This repository provides the HuggingFace-compatible implementation, including model definition, weight conversion, and inference scripts.
Experimental Results on Seed Benchmark
LongCat-AudioDiT obtains state-of-the-art (SOTA) voice cloning performance on the Seed-benchmark, surpassing both close-source and open-source modles.
| Model | ZH CER (%) ↓ | ZH SIM ↑ | EN WER (%) ↓ | EN SIM ↑ | ZH-Hard CER (%) ↓ | ZH-Hard SIM ↑ |
|:---|:---:|:---:|:---:|:---:|:---:|:---:|
| GT | 1.26 | 0.755 | 2.14 | 0.734 | - | - |
| Seed-DiT | 1.18 | 0.809 | 1.73 | 0.790 | - | - |
| MaskGCT | 2.27 | 0.774 | 2.62 | 0.714 | 10.27 | 0.748 |
| E2 TTS | 1.97 | 0.730 | 2.19 | 0.710 | - | - |
| F5 TTS | 1.56 | 0.741 | 1.83 | 0.647 | 8.67 | 0.713 |
| F5R-TTS | 1.37 | 0.754 | - | - | 8.79 | 0.718 |
| ZipVoice | 1.40 | 0.751 | 1.64 | 0.668 | - | - |
| Seed-ICL | 1.12 | 0.796 | 2.25 | 0.762 | 7.59 | 0.776 |
| SparkTTS | 1.20 | 0.672 | 1.98 | 0.584 | - | - |
| FireRedTTS | 1.51 | 0.635 | 3.82 | 0.460 | 17.45 | 0.621 |
| Qwen2.5-Omni | 1.70 | 0.752 | 2.72 | 0.632 | 7.97 | 0.747 |
| Qwen2.5-Omni_RL | 1.42 | 0.754 | 2.33 | 0.641 | 6.54 | 0.752 |
| CosyVoice | 3.63 | 0.723 | 4.29 | 0.609 | 11.75 | 0.709 |
| CosyVoice2 | 1.45 | 0.748 | 2.57 | 0.652 | 6.83 | 0.724 |
| FireRedTTS-1S | 1.05 | 0.750 | 2.17 | 0.660 | 7.63 | 0.748 |
| CosyVoice3-1.5B | 1.12 | 0.781 | 2.21 | 0.720 | 5.83 | 0.758 |
| IndexTTS2 | 1.03 | 0.765 | 2.23 | 0.706 | 7.12 | 0.755 |
| DiTAR | 1.02 | 0.753 | 1.69 | 0.735 | - | - |
| MiniMax-Speech | 0.99 | 0.799 | 1.90 | 0.738 | - | - |
| VoxCPM | 0.93 | 0.772 | 1.85 | 0.729 | 8.87 | 0.730 |
| MOSS-TTS | 1.20 | 0.788 | 1.85 | 0.734 | - | - |
| Qwen3-TTS | 1.22 | 0.770 | 1.23 | 0.717 | 6.76 | 0.748 |
| CosyVoice3.5 | 0.87 | 0.797 | 1.57 | 0.738 | 5.71 | 0.786 |
| LongCat-AudioDiT-1B | 1.18 | 0.812 | 1.78 | 0.762 | 6.33 | 0.787 |
| LongCat-AudioDiT-3.5B | 1.09 | 0.818 | 1.50 | 0.786 | 6.04 | 0.797 |
Installation
pip install -r requirements.txt
CLI Inference
# TTS
python inference.py --text "今天晴暖转阴雨,空气质量优至良,空气相对湿度较低。" --output_audio output.wav --model_dir meituan-longcat/LongCat-AudioDiT-1B
# Voice cloning
python inference.py \
--text "今天晴暖转阴雨,空气质量优至良,空气相对湿度较低。" \
--prompt_text "小偷却一点也不气馁,继续在抽屉里翻找。" \
--prompt_audio assets/prompt.wav \
--output_audio output.wav \
--model_dir meituan-longcat/LongCat-AudioDiT-1B \
--guidance_method apg
# Batch inference (SeedTTS eval format, one item per line: uid|prompt_text|prompt_wav_path|gen_text)
python batch_inference.py \
--lst /path/to/meta.lst \
--output_dir /path/to/output \
--model_dir meituan-longcat/LongCat-AudioDiT-1B \
--guidance_method apg
Inference (Python API)
1. TTS
import audiodit # auto-registers with transformers
from audiodit import AudioDiTModel
from transformers import AutoTokenizer
import torch, soundfile as sf
# Load model
model = AudioDiTModel.from_pretrained("meituan-longcat/LongCat-AudioDiT-1B").to("cuda")
model.vae.to_half() # VAE runs in fp16 (matching original)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder_model)
# Zero-shot synthesis
inputs = tokenizer(["今天晴暖转阴雨,空气质量优至良,空气相对湿度较低。"], padding="longest", return_tensors="pt")
output = model(
input_ids=inputs.input_ids,
attention_mask=inputs.attention_mask,
duration=62, # latent frames
steps=16,
cfg_strength=4.0,
guidance_method="cfg", # or "apg"
seed=1024,
)
sf.write("output.wav", output.waveform.squeeze().cpu().numpy(), 24000)
2. Voice Cloning (with prompt audio)
import librosa, torch
# Load prompt audio
audio, _ = librosa.load("assets/prompt.wav", sr=24000, mono=True)
prompt_wav = torch.from_numpy(audio).unsqueeze(0).unsqueeze(0) # (1, 1, T)
# Concatenate prompt_text + gen_text for the text encoder
prompt_text = "小偷却一点也不气馁,继续在抽屉里翻找。"
gen_text = "今天晴暖转阴雨,空气质量优至良,空气相对湿度较低。"
inputs = tokenizer([f"{prompt_text} {gen_text}"], padding="longest", return_tensors="pt")
output = model(
input_ids=inputs.input_ids,
attention_mask=inputs.attention_mask,
prompt_audio=prompt_wav,
duration=138, # prompt_frames + gen_frames
steps=16,
cfg_strength=4.0,
guidance_method="apg",
seed=1024,
)
License Agreement
This repository, including both the model weights and the source code, is released under the MIT License.
Any contributions to this repository are licensed under the MIT License, unless otherwise stated. This license does not grant any rights to use Meituan trademarks or patents.
For details, see the LICENSE file.
Author: meituan-longcat
Likes: 21
Downloads: 0
Tags: safetensors, audiodit, zh, en, license:mit, region:us
meituan-longcat/LongCat-AudioDiT-1B
license: mit
language:
LongCat-AudioDiT: High-Fidelity Diffusion Text-to-Speech in the Waveform Latent Space
<div align="center">
<img src="./LongCat-AudioDiT.svg" width="45%" alt="LongCat-AudioDiT" />
</div>
<hr>
<div align="center" style="line-height: 1;">
<a href="https://github.com/meituan-longcat/LongCat-AudioDiT/blob/main/LongCat-AudioDiT.pdf">
<img alt="License" src="https://img.shields.io/badge/Paper-LongCatAudioDiT-blue" style="display: inline-block; vertical-align: middle;"/>
</a>
<a href="https://github.com/meituan-longcat/LongCat-AudioDiT" target="_blank" style="margin: 2px;">
<img alt="GitHub" src="https://img.shields.io/badge/GitHub-LongCatAudioDiT-white?logo=github&logoColor=white&color=a4b5d5" style="display: inline-block; vertical-align: middle;"/>
</a>
</div>
<div align="center" style="line-height: 1;">
<a href="https://huggingface.co/meituan-longcat/LongCat-AudioDiT-3.5B" target="_blank" style="margin: 2px;">
<img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-LongCatAudioDiT3.5B-ffc107?color=ffc107&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
</a>
<a href="https://huggingface.co/meituan-longcat/LongCat-AudioDiT-1B" target="_blank" style="margin: 2px;">
<img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-LongCatAudioDiT1B-ffc107?color=ffc107&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
</a>
</div>
<div align="center" style="line-height: 1;">
<a href="https://github.com/meituan-longcat/LongCat-AudioDiT/blob/main/assets/wechat_official_accounts.png" target="_blank" style="margin: 2px;">
<img alt="Wechat" src="https://img.shields.io/badge/WeChat-LongCat-brightgreen?logo=wechat&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
</a>
<a href="https://x.com/Meituan_LongCat" target="_blank" style="margin: 2px;">
<img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-LongCat-white?logo=x&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
</a>
<a href="https://huggingface.co/meituan-longcat/LongCat-AudioDiT-1B/blob/main/LICENSE" style="margin: 2px;">
<img alt="License" src="https://img.shields.io/badge/License-MIT-f5de53?&color=f5de53" style="display: inline-block; vertical-align: middle;"/>
</a>
</div>
Introduction
LongCat-AudioDiT is a state-of-the-art (SOTA) diffusion-based text-to-speech (TTS) model that directly operates on the waveform latent space.
Abstract: We present LongCat-TTS, a novel, non-autoregressive diffusion-based text-to-speech (TTS) model that achieves state-of-the-art (SOTA) performance.
Unlike previous methods that rely on intermediate acoustic representations such as mel-spectrograms, the core innovation of LongCat-TTS lies in operating directly within the waveform latent space. This approach effectively mitigates compounding errors and drastically simplifies the TTS pipeline, requiring only a waveform variational autoencoder (Wav-VAE) and a diffusion backbone.
Furthermore, we introduce two critical improvements to the inference process: first, we identify and rectify a long-standing training-inference mismatch; second, we replace traditional classifier-free guidance with adaptive projection guidance to elevate generation quality.
Experimental results demonstrate that, despite the absence of complex multi-stage training pipelines or high-quality human-annotated datasets, LongCat-TTS achieves SOTA zero-shot voice cloning performance on the Seed benchmark while maintaining competitive intelligibility.
Specifically, our largest variant, LongCat-TTS-3.5B, outperforms the previous SOTA model (Seed-TTS), improving the speaker similarity (SIM) scores from 0.809 to 0.818 on Seed-ZH, and from 0.776 to 0.797 on Seed-Hard.
Finally, through comprehensive ablation studies and systematic analysis, we validate the effectiveness of our proposed modules.
Notably, we investigate the interplay between the Wav-VAE and the TTS backbone, revealing the counterintuitive finding that superior reconstruction fidelity in the Wav-VAE does not necessarily lead to better overall TTS performance.
Code and model weights are released to foster further research within the speech community.
<div align="center">
<img src="./architecture.png" width="75%" alt="LongCat-AudioDiT" />
</div>
This repository provides the HuggingFace-compatible implementation, including model definition, weight conversion, and inference scripts.
Experimental Results on Seed Benchmark
LongCat-AudioDiT obtains state-of-the-art (SOTA) voice cloning performance on the Seed-benchmark, surpassing both close-source and open-source modles.
| Model | ZH CER (%) ↓ | ZH SIM ↑ | EN WER (%) ↓ | EN SIM ↑ | ZH-Hard CER (%) ↓ | ZH-Hard SIM ↑ |
|:---|:---:|:---:|:---:|:---:|:---:|:---:|
| GT | 1.26 | 0.755 | 2.14 | 0.734 | - | - |
| Seed-DiT | 1.18 | 0.809 | 1.73 | 0.790 | - | - |
| MaskGCT | 2.27 | 0.774 | 2.62 | 0.714 | 10.27 | 0.748 |
| E2 TTS | 1.97 | 0.730 | 2.19 | 0.710 | - | - |
| F5 TTS | 1.56 | 0.741 | 1.83 | 0.647 | 8.67 | 0.713 |
| F5R-TTS | 1.37 | 0.754 | - | - | 8.79 | 0.718 |
| ZipVoice | 1.40 | 0.751 | 1.64 | 0.668 | - | - |
| Seed-ICL | 1.12 | 0.796 | 2.25 | 0.762 | 7.59 | 0.776 |
| SparkTTS | 1.20 | 0.672 | 1.98 | 0.584 | - | - |
| FireRedTTS | 1.51 | 0.635 | 3.82 | 0.460 | 17.45 | 0.621 |
| Qwen2.5-Omni | 1.70 | 0.752 | 2.72 | 0.632 | 7.97 | 0.747 |
| Qwen2.5-Omni_RL | 1.42 | 0.754 | 2.33 | 0.641 | 6.54 | 0.752 |
| CosyVoice | 3.63 | 0.723 | 4.29 | 0.609 | 11.75 | 0.709 |
| CosyVoice2 | 1.45 | 0.748 | 2.57 | 0.652 | 6.83 | 0.724 |
| FireRedTTS-1S | 1.05 | 0.750 | 2.17 | 0.660 | 7.63 | 0.748 |
| CosyVoice3-1.5B | 1.12 | 0.781 | 2.21 | 0.720 | 5.83 | 0.758 |
| IndexTTS2 | 1.03 | 0.765 | 2.23 | 0.706 | 7.12 | 0.755 |
| DiTAR | 1.02 | 0.753 | 1.69 | 0.735 | - | - |
| MiniMax-Speech | 0.99 | 0.799 | 1.90 | 0.738 | - | - |
| VoxCPM | 0.93 | 0.772 | 1.85 | 0.729 | 8.87 | 0.730 |
| MOSS-TTS | 1.20 | 0.788 | 1.85 | 0.734 | - | - |
| Qwen3-TTS | 1.22 | 0.770 | 1.23 | 0.717 | 6.76 | 0.748 |
| CosyVoice3.5 | 0.87 | 0.797 | 1.57 | 0.738 | 5.71 | 0.786 |
| LongCat-AudioDiT-1B | 1.18 | 0.812 | 1.78 | 0.762 | 6.33 | 0.787 |
| LongCat-AudioDiT-3.5B | 1.09 | 0.818 | 1.50 | 0.786 | 6.04 | 0.797 |
Installation
pip install -r requirements.txt
CLI Inference
# TTS
python inference.py --text "今天晴暖转阴雨,空气质量优至良,空气相对湿度较低。" --output_audio output.wav --model_dir meituan-longcat/LongCat-AudioDiT-1B
# Voice cloning
python inference.py \
--text "今天晴暖转阴雨,空气质量优至良,空气相对湿度较低。" \
--prompt_text "小偷却一点也不气馁,继续在抽屉里翻找。" \
--prompt_audio assets/prompt.wav \
--output_audio output.wav \
--model_dir meituan-longcat/LongCat-AudioDiT-1B \
--guidance_method apg
# Batch inference (SeedTTS eval format, one item per line: uid|prompt_text|prompt_wav_path|gen_text)
python batch_inference.py \
--lst /path/to/meta.lst \
--output_dir /path/to/output \
--model_dir meituan-longcat/LongCat-AudioDiT-1B \
--guidance_method apg
Inference (Python API)
1. TTS
import audiodit # auto-registers with transformers
from audiodit import AudioDiTModel
from transformers import AutoTokenizer
import torch, soundfile as sf
# Load model
model = AudioDiTModel.from_pretrained("meituan-longcat/LongCat-AudioDiT-1B").to("cuda")
model.vae.to_half() # VAE runs in fp16 (matching original)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder_model)
# Zero-shot synthesis
inputs = tokenizer(["今天晴暖转阴雨,空气质量优至良,空气相对湿度较低。"], padding="longest", return_tensors="pt")
output = model(
input_ids=inputs.input_ids,
attention_mask=inputs.attention_mask,
duration=62, # latent frames
steps=16,
cfg_strength=4.0,
guidance_method="cfg", # or "apg"
seed=1024,
)
sf.write("output.wav", output.waveform.squeeze().cpu().numpy(), 24000)
2. Voice Cloning (with prompt audio)
import librosa, torch
# Load prompt audio
audio, _ = librosa.load("assets/prompt.wav", sr=24000, mono=True)
prompt_wav = torch.from_numpy(audio).unsqueeze(0).unsqueeze(0) # (1, 1, T)
# Concatenate prompt_text + gen_text for the text encoder
prompt_text = "小偷却一点也不气馁,继续在抽屉里翻找。"
gen_text = "今天晴暖转阴雨,空气质量优至良,空气相对湿度较低。"
inputs = tokenizer([f"{prompt_text} {gen_text}"], padding="longest", return_tensors="pt")
output = model(
input_ids=inputs.input_ids,
attention_mask=inputs.attention_mask,
prompt_audio=prompt_wav,
duration=138, # prompt_frames + gen_frames
steps=16,
cfg_strength=4.0,
guidance_method="apg",
seed=1024,
)
License Agreement
This repository, including both the model weights and the source code, is released under the MIT License.
Any contributions to this repository are licensed under the MIT License, unless otherwise stated. This license does not grant any rights to use Meituan trademarks or patents.
For details, see the LICENSE file.
Author: meituan-longcat
Likes: 18
Downloads: 0
Tags: safetensors, audiodit, zh, en, license:mit, region:us
bosonai/higgs-audio-v3-8b-stt
license: apache-2.0
language:
- en
tags:
- automatic-speech-recognition
- whisper
- qwen
pipeline_tag: automatic-speech-recognition
Higgs Audio v3 8B STT
A speech-to-text model combining a Whisper-Large-v3 encoder with a Qwen3-8B decoder (8.91B total parameters), fine-tuned with LoRA on diverse ASR benchmarks.
Usage
Important: This model uses a custom architecture. You must pass trust_remote_code=True when loading.
import torch
from transformers import AutoConfig, AutoModel, AutoTokenizer
model = AutoModel.from_pretrained(
"bosonai/higgs-audio-v3-8b-stt",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
attn_implementation="eager",
device_map="cuda:0",
)
tokenizer = AutoTokenizer.from_pretrained("bosonai/higgs-audio-v3-8b-stt")
Requirements
torch
transformers>=4.51.0
boson_multimodal # for audio preprocessing
Architecture
- Encoder: Whisper-Large-v3 (frozen)
- Decoder: Qwen3-8B (LoRA fine-tuned, merged)
- Total parameters: 8.91B
- Audio input: 16kHz mono WAV
- Supports: Thinking mode for improved accuracy
Performance (ESB Benchmark — Full Scale)
| Dataset | WER |
|---------|-----|
| AMI | 9.85% |
| Earnings22 | 9.01% |
| GigaSpeech | 8.54% |
| LibriSpeech Clean | 1.28% |
| LibriSpeech Other | 2.41% |
| SPGISpeech | 3.53% |
| TED-LIUM | 2.74% |
| VoxPopuli | 6.07% |
| Average | 5.43% |
Author: bosonai
Likes: 4
Downloads: 0
Tags: safetensors, higgs_audio_3, automatic-speech-recognition, whisper, qwen, custom_code, en, license:apache-2.0, region:us
SciMaker/T-qwen3.5-4B
重新訓練 qwen3.5 的模型,解決了兩個問題,一個是提示使用繁體中對話時,幾乎不會出現簡體中文,另一個是這模型回答關於台灣相關問題時,觀點已經是正常的了,以世界的認知為中心,不會再以中國的觀點為中心! 把這系列取名 T-qwen ,這裡的T代表台灣跟繁體中文
Author: SciMaker
Likes: 4
Downloads: 0
Tags: gguf, endpoints_compatible, region:us, conversational
YTan2000/Qwen3.5-27B-TQ3_1S
license: mit
language:
- en
library_name: gguf
pipeline_tag: text-generation
tags:
- gguf
- llama.cpp
- qwen
- qwen3.5
- quantization
- turboquant
- wht
base_model:
- Qwen/Qwen3.5-27B
Qwen3.5-27B-TQ3_1S
Qwen3.5-27B-TQ3_1S is a GGUF quantization of Qwen/Qwen3.5-27B using TQ3_1S, a 3.5-bit weight format based on:
- Walsh-Hadamard rotation
- 8 centroid quantization
- dual half-block scales
This release is aimed at one practical outcome:
- near-Q4_0 quality
- about 10% smaller than Q4_0
- small enough to fit fully on a single 16 GB RTX 5060 Ti in the tested
llama.cpp setup
Headline Result
Gold-standard wiki.test.raw pass, c=512, full 580 chunks:
| Format | PPL | Size |
|---|---:|---:|
| Q4_0 | 7.2431 +/- 0.0482 | 14.4 GB |
| TQ3_1S | 7.2570 +/- 0.0480 | 12.9 GB |
Measured gap:
Safe interpretation:
TQ3_1S is near-Q4_0 quality
TQ3_1S is about 1.5 GB smaller
- on this 27B model, that size reduction is enough to change deployment on a 16 GB GPU
Important Caveat
This model card does not claim that TQ3_1S is universally faster than native Q4_0 under the same conditions.
The practical speed win in the tested setup comes mainly from:
TQ3_1S fitting fully on GPU
- while
Q4_0 does not fit fully on GPU on the same 16 GB card
So this is primarily a deployment / fit advantage story, not a blanket kernel-speed claim.
Files
Base Model
Recommended Runtime
This model is intended for the public TQ3 runtime fork:
- GitHub:
https://github.com/turbo-tan/llama.cpp-tq3
It requires TQ3_1S runtime support and will not run on a stock llama.cpp build unless that support is present.
Example
./build/bin/llama-server \
-m Qwen3.5-27B-TQ3_1S.gguf \
-ngl 99 \
-fa on \
-c 4096
Quantization Notes
TQ3_1S uses a 32-element block layout:
[d0: fp16][d1: fp16][qs: 12 bytes]
That is:
- 16 bytes per 32 weights
- 4.0 bits per weight at the block level
Credit
This work is inspired by the broader line of transform-based quantization methods, especially RaBitQ-style Walsh-Hadamard rotation ideas, adapted here for LLM weight quantization in GGUF / llama.cpp.
Limitations
- This 27B result does not imply that plain TQ3_1S is equally strong on smaller dense models.
- In internal testing, 9B models were much less forgiving at this bitrate.
- This release is a practical 27B deployment artifact, not a universal claim about all model scales.
License
Same model license terms as the base model apply.
Author: YTan2000
Likes: 3
Downloads: 0
Tags: gguf, llama.cpp, qwen, qwen3.5, quantization, turboquant, wht, text-generation, en, base_model:Qwen/Qwen3.5-27B, base_model:quantized:Qwen/Qwen3.5-27B, license:mit, endpoints_compatible, region:us, imatrix, conversational
QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ
library_name: transformers
license: apache-2.0
license_link: https://huggingface.co/Qwen/Qwen3.5-27B/blob/main/LICENSE
pipeline_tag: image-text-to-text
tags:
- vLLM
- AWQ
base_model:
- Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2
base_model_relation: quantized
Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ
Base model: Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2
This repo quantizes the model using data-free quantization (no calibration dataset required).
【Dependencies / Installation】
vllm>=0.18.0
transformers>=5.3.0.dev0
As of 2026-03-30, make sure your system has cuda12.8 installed.
Then, create a fresh Python environment (e.g. python3.12 venv) and run:
pip install vllm==0.18.0
# upgrade transformers so that applications could properly execute tool calls
pip install -U "transformers @ git+https://github.com/huggingface/transformers.git@f2ba019"
# locate modeling_rope_utils.py line 651 to fix a simple bug
TF_FILE="$(python -m pip show transformers | awk -F': ' '/^Location:/{print $2}')/transformers/modeling_rope_utils.py" && echo "$TF_FILE"
NEW_LINE=' ignore_keys_at_rope_validation = set(ignore_keys_at_rope_validation) | {"partial_rotary_factor"}' \
perl -i.bak -pe 'if ($. == 651) { $_ = $ENV{NEW_LINE} . "\n" }' "$TF_FILE"
vLLM Official Guide
【vLLM Startup Command】
export OMP_NUM_THREADS=4
vllm serve \
__YOUR_PATH__/QuantTrio/Qwen3.5-27B-AWQ \
--served-model-name MY_MODEL \
--max-num-seqs 32 \
--max-model-len 32768 \
--gpu-memory-utilization 0.9 \
--tensor-parallel-size 2 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' \
--trust-remote-code \
--host 0.0.0.0 \
--port 8000
【Logs】
2026-03-30
1. Initial commit
【Model Files】
| File Size | Last Updated |
|-----------|--------------|
| 21GiB | 2026-03-30 |
【Model Download】
from huggingface_hub import snapshot_download
snapshot_download('QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ', cache_dir="your_local_path")
【Overview】
🌟 Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2
📢 Announcement
v2 Update:
-
Accuracy preserved: Matches base model on HumanEval (96.91% pass@1)
-
Shorter reasoning: ~24% reduction in chain-of-thought length
-
Higher efficiency: +31.6% more correct solutions per token
-
⚠️Trade-off: −1.24% on HumanEval+ −7.2% on MMLU-Pro (Indicating reduced general knowledge reasoning performance)
⚠️Note: Due to the scope of SFT data and training focus, the model may underperform the base model on certain tasks requiring long-context understanding or more complex multi-step reasoning. The efficiency and accuracy results reported here are based solely on the HumanEval and HumanEval+ benchmarks. Thank you for your understanding.

💡 Model Introduction
Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2 is the second iteration of this reasoning-focused Qwen3.5-27B fine-tune, built to drastically improve the efficiency of chain-of-thought generation, unlocking highly substantial gains in reasoning speed and cost-reduction while actually increasing absolute accuracy.
Compared with the earlier version, v2 was trained with 14,000 Claude 4.6 Opus-style general reasoning samples, with a stronger emphasis on transferring concise, reusable reasoning patterns rather than only maximizing raw benchmark scores. The goal of v2 is not simply to make the model "think more," but to help it think more economically: reducing unnecessarily long internal chains, avoiding verbose over-analysis on easy problems, and massively improving the reasoning-cost-to-quality ratio while beating the baseline's benchmark correctness.
A key design choice in v2 is that the distillation data is primarily general-domain reasoning data—specifically focused on mathematics, word problems, logical deduction, and a balanced mix of general knowledge and instructions—rather than specialized code-heavy supervision. Consequently, HumanEval and HumanEval+ are employed here to evaluate cross-task generalization and capability transfer, rather than serving as direct optimization targets. High performance on these benchmarks, despite the lack of code-centric training, confirms that the model's reasoning scaffold has become more robust and transferable, proving that fundamental reasoning logic can effectively power specialized tasks like programming.
HumanEval Benchmark Analysis 🪐
The raw evaluation outputs for both models were independently cleaned, verified, and aggregated using GPT-5.4-Pro-Thinking. The final comparative results are based on these standardized and curated outputs. To ensure reliability, all results were further cross-checked and consolidated through two rounds of independent validation using Claude-4.6-Opus-Thinking.
-All evaluations were conducted in an inference environment based on Unsloth + vLLM (BF16) to ensure consistent and efficient execution conditions.






🗺️ Training Pipeline Overview
Base Model (Qwen3.5-27B)
│
▼
Qwen3.5-27B fine-tuned with Unsloth
│
▼
Supervised Fine-Tuning (SFT) + LoRA
(Response-Only Training masked on "<|im_start|>assistant\n<think>")
│
▼
Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2
🧠 Example of Learned Reasoning Scaffold(Example)
The model includes targeted optimizations addressing Qwen3.5’s tendency toward excessive transitional or repetitive reasoning on simple queries. Through deep distillation and structural imitation of Claude-4.6-Opus reasoning chains, the model adopts a more efficient structured thinking pattern:
“Let me analyze this request carefully: 1..2..3...”.
This streamlined reasoning paradigm significantly reduces redundant cognitive loops while preserving deep analytical capacity, resulting in substantially improved inference efficiency.
Let me analyze this request carefully:
1. Identify the core objective of the problem.
2. Break the task into clearly defined subcomponents.
3. Evaluate constraints and edge cases.
4. Formulate a step-by-step solution plan.
5. Execute the reasoning sequentially and verify consistency.
.
.
.
📚 All Datasets Used
The dataset consists of high-quality, filtered reasoning distillation data:
| Dataset Name | Description / Purpose |
|--------------|-----------------------|
| nohurry/Opus-4.6-Reasoning-3000x-filtered | Provides comprehensive Claude 4.6 Opus reasoning trajectories. |
| Roman1111111/claude-opus-4.6-10000x | Large-scale public Claude 4.6 Opus distillation data used to strengthen general reasoning transfer in v2. |
| TeichAI/claude-4.5-opus-high-reasoning-250x | Injecting high-intensity, structured reasoning instances. |
| Jackrong/Qwen3.5-reasoning-700x | Additional curated reasoning samples designed to strengthen structured step-by-step problem solving and improve reasoning diversity. |
⚠️ Limitations & Intended Use
- Hallucination Risk: While reasoning is strong, the model remains an autoregressive LLM; external facts provided during the thinking sequence may occasionally contain hallucinations if verifying real-world events.
- Intended Scenario: Best suited for offline analytical tasks, coding, math, and heavy logic-dependent prompting where the user needs to transparently follow the AI's internal logic.
- This model is a test version intended solely for learning and demonstration purposes, and is for academic research and technical exploration use only.
🙏 Acknowledgements
Significant thanks to the Unsloth AI team for making rapid fine-tuning of large LLM models accessible. Additionally, we acknowledge Qwen internally, and the open-source community developers producing exceptional distilled datasets.
📖 Citation
If you use this model in your research or projects, please cite:
@misc{jackrong_qwen35_opus_distilled,
title = {Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2},
author = {Jackrong},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2}}
}
Author: QuantTrio
Likes: 3
Downloads: 0
Tags: transformers, safetensors, qwen3_5, image-text-to-text, vLLM, AWQ, conversational, base_model:Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2, base_model:quantized:Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2, license:apache-2.0, endpoints_compatible, 4-bit, awq, region:us
stukenov/sozkz-core-omniaudio-70m-kk-asr-v1
Author: stukenov
Likes: 3
Downloads: 0
Tags: speech-recognition, asr, kazakh, audio, omniaudio, automatic-speech-recognition, kk, license:mit, model-index, region:us