Todays AI Summary

AI Developments: Qwen3-Next Model and Reinforcement Learning for Reasoning

Today's AI landscape features advancements in both model architecture and research methodologies. A notable new model release focuses on efficient scaling and long-context handling, while research explores the application of reinforcement learning to enhance reasoning in large language models.

Research Highlights

Several interesting research papers have emerged:

  • Reinforcement Learning for Large Reasoning Models: A survey paper (arXiv:2509.08827) examines the use of reinforcement learning (RL) to improve reasoning in large language models (LLMs), transforming them into large reasoning models (LRMs). It identifies challenges in computational resources, algorithm design, training data, and infrastructure, and explores strategies to enhance the scalability of RL for broader reasoning models.
  • Merge-of-Thought Distillation: This paper (arXiv:2509.08814) introduces a framework called Merge-of-Thought Distillation (MoT) to unify the reasoning abilities of multiple teachers into a student model. MoT alternates between teacher-specific supervised fine-tuning and weight-space merging. Results show that MoT surpasses strong models on competition math benchmarks.
  • AI Fact-Checking and Confidence Paradox: The paper "Scaling Truth: The Confidence Paradox in AI Fact-Checking" (arXiv:2509.08803) evaluates LLMs across multiple categories and languages, revealing that smaller models show high confidence despite lower accuracy, while larger models demonstrate higher accuracy but lower confidence.

Model Spotlight: Qwen3-Next-80B-A3B-Instruct

The model Qwen3-Next-80B-A3B-Instruct by unsloth stands out with 18 likes. This model is part of the Qwen3-Next series and introduces several key enhancements:

  • Hybrid Attention: Combines Gated DeltaNet and Gated Attention for efficient context modeling, enabling ultra-long context length.
  • High-Sparsity Mixture-of-Experts (MoE): Reduces FLOPs per token while preserving model capacity.
  • Stability Optimizations: Includes techniques like zero-centered and weight-decayed layernorm for robust training.
  • Multi-Token Prediction (MTP): Boosts pretraining model performance and accelerates inference.

The model achieves strong performance in parameter efficiency and inference speed, outperforming Qwen3-32B-Base on downstream tasks with lower training costs and significantly higher inference throughput for contexts over 32K tokens. It also performs on par with Qwen3-235B-A22B-Instruct-2507 on certain benchmarks, with advantages in handling ultra-long-context tasks up to 256K tokens. The model natively supports context lengths of up to 262,144 tokens and can be extended up to 1,010,000 tokens using RoPE scaling techniques.

Key Takeaways

  • Efficient Scaling: The Qwen3-Next model demonstrates advancements in scaling efficiency through innovative model architecture, particularly in handling ultra-long context lengths.
  • Reinforcement Learning for Reasoning: RL continues to be a key methodology for enhancing reasoning capabilities in LLMs, though challenges remain in scaling and resource requirements.
  • Model Confidence and Accuracy: The research on AI fact-checking highlights a potential "confidence paradox," where smaller models may exhibit high confidence despite lower accuracy, raising concerns about bias in information verification.

AI Papers for 2026-03-31

Ruka-v2: Tendon Driven Open-Source Dexterous Hand with Wrist and Abduction for Robot Learning

Lack of accessible and dexterous robot hardware has been a significant bottleneck to achieving human-level dexterity in robots. Last year, we released Ruka, a fully open-sourced, tendon-driven humanoid hand with 11 degrees of freedom - 2 per finger and 3 at the thumb - buildable for under $1,300. It was one of the first fully open-sourced humanoid hands, and introduced a novel data-driven approach to finger control that captures tendon dynamics within the control system. Despite these contributions, Ruka lacked two degrees of freedom essential for closely imitating human behavior: wrist mobility and finger adduction/abduction. In this paper, we introduce Ruka-v2: a fully open-sourced, tendon-driven humanoid hand featuring a decoupled 2-DOF parallel wrist and abduction/adduction at the fingers. The parallel wrist adds smooth, independent flexion/extension and radial/ulnar deviation, enabling manipulation in confined environments such as cabinets. Abduction enables motions such as grasping thin objects, in-hand rotation, and calligraphy. We present the design of Ruka-v2 and evaluate it against Ruka through user studies on teleoperated tasks, finding a 51.3% reduction in completion time and a 21.2% increase in success rate. We further demonstrate its full range of applications for robot learning: bimanual and single-arm teleoperation across 13 dexterous tasks, and autonomous policy learning on 3 tasks. All 3D print files, assembly instructions, controller software, and videos are available at https://ruka-hand-v2.github.io/ .

PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning

We introduce PerceptionComp, a manually annotated benchmark for complex, long-horizon, perception-centric video reasoning. PerceptionComp is designed so that no single moment is sufficient: answering each question requires multiple temporally separated pieces of visual evidence and compositional constraints under conjunctive and sequential logic, spanning perceptual subtasks such as objects, attributes, relations, locations, actions, and events, and requiring skills including semantic recognition, visual correspondence, temporal reasoning, and spatial reasoning. The benchmark contains 1,114 highly complex questions on 279 videos from diverse domains including city walk tours, indoor villa tours, video games, and extreme outdoor sports, with 100% manual annotation. Human studies show that PerceptionComp requires substantial test-time thinking and repeated perception steps: participants take much longer than on prior benchmarks, and accuracy drops to near chance (18.97%) when rewatching is disallowed. State-of-the-art MLLMs also perform substantially worse on PerceptionComp than on existing benchmarks: the best model in our evaluation, Gemini-3-Flash, reaches only 45.96% accuracy in the five-choice setting, while open-source models remain below 40%. These results suggest that perception-centric long-horizon video reasoning remains a major bottleneck, and we hope PerceptionComp will help drive progress in perceptual reasoning.

Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification

Recent advances in large language models have improved the capabilities of coding agents, yet systematic evaluation of complex, end-to-end website development remains limited. To address this gap, we introduce Vision2Web, a hierarchical benchmark for visual website development, spanning from static UI-to-code generation, interactive multi-page frontend reproduction, to long-horizon full-stack website development. The benchmark is constructed from real-world websites and comprises a total of 193 tasks across 16 categories, with 918 prototype images and 1,255 test cases. To support flexible, thorough and reliable evaluation, we propose workflow-based agent verification paradigm based on two complementary components: a GUI agent verifier and a VLM-based judge. We evaluate multiple visual language models instantiated under different coding-agent frameworks, revealing substantial performance gaps at all task levels, with state-of-the-art models still struggling on full-stack development.

Make Geometry Matter for Spatial Reasoning

Empowered by large-scale training, vision-language models (VLMs) achieve strong image and video understanding, yet their ability to perform spatial reasoning in both static scenes and dynamic videos remains limited. Recent advances try to handle this limitation by injecting geometry tokens from pretrained 3D foundation models into VLMs. Nevertheless, we observe that naive token fusion followed by standard fine-tuning in this line of work often leaves such geometric cues underutilized for spatial reasoning, as VLMs tend to rely heavily on 2D visual cues. In this paper, we propose GeoSR, a framework designed to make geometry matter by encouraging VLMs to actively reason with geometry tokens. GeoSR introduces two key components: (1) Geometry-Unleashing Masking, which strategically masks portions of 2D vision tokens during training to weaken non-geometric shortcuts and force the model to consult geometry tokens for spatial reasoning; and (2) Geometry-Guided Fusion, a gated routing mechanism that adaptively amplifies geometry token contributions in regions where geometric evidence is critical. Together, these designs unleash the potential of geometry tokens for spatial reasoning tasks. Extensive experiments on both static and dynamic spatial reasoning benchmarks demonstrate that GeoSR consistently outperforms prior methods and establishes new state-of-the-art performance by effectively leveraging geometric information. The project page is available at https://suhzhang.github.io/GeoSR/.

Machine Learning Transferability for Malware Detection

Malware continues to be a predominant operational risk for organizations, especially when obfuscation techniques are used to evade detection. Despite the ongoing efforts in the development of Machine Learning (ML) detection approaches, there is still a lack of feature compatibility in public datasets. This limits generalization when facing distribution shifts, as well as transferability to different datasets. This study evaluates the suitability of different data preprocessing approaches for the detection of Portable Executable (PE) files with ML models. The preprocessing pipeline unifies EMBERv2 (2,381-dim) features datasets, trains paired models under two training setups: EMBER + BODMAS and EMBER + BODMAS + ERMDS. Regarding model evaluation, both EMBER + BODMAS and EMBER + BODMAS + ERMDS models are tested against TRITIUM, INFERNO and SOREL-20M. ERMDS is also used for testing for the EMBER + BODMAS setup.

Think over Trajectories: Leveraging Video Generation to Reconstruct GPS Trajectories from Cellular Signaling

Mobile devices continuously interact with cellular base stations, generating massive volumes of signaling records that provide broad coverage for understanding human mobility. However, such records offer only coarse location cues (e.g., serving-cell identifiers) and therefore limit their direct use in applications that require high-precision GPS trajectories. This paper studies the Sig2GPS problem: reconstructing GPS trajectories from cellular signaling. Inspired by domain experts often lay the signaling trace on the map and sketch the corresponding GPS route, unlike conventional solutions that rely on complex multi-stage engineering pipelines or regress coordinates, Sig2GPS is reframed as an image-to-video generation task that directly operates in the map-visual domain: signaling traces are rendered on a map, and a video generation model is trained to draw a continuous GPS path. To support this paradigm, a paired signaling-to-trajectory video dataset is constructed to fine-tune an open-source video model, and a trajectory-aware reinforcement learning-based optimization method is introduced to improve generation fidelity via rewards. Experiments on large-scale real-world datasets show substantial improvements over strong engineered and learning-based baselines, while additional results on next GPS prediction indicate scalability and cross-city transferability. Overall, these results suggest that map-visual video generation provides a practical interface for trajectory data mining by enabling direct generation and refinement of continuous paths under map constraints.

Sustainability Is Not Linear: Quantifying Performance, Energy, and Privacy Trade-offs in On-Device Intelligence

The migration of Large Language Models (LLMs) from cloud clusters to edge devices promises enhanced privacy and offline accessibility, but this transition encounters a harsh reality: the physical constraints of mobile batteries, thermal limits, and, most importantly, memory constraints. To navigate this landscape, we constructed a reproducible experimental pipeline to profile the complex interplay between energy consumption, latency, and quality. Unlike theoretical studies, we captured granular power metrics across eight models ranging from 0.5B to 9B parameters without requiring root access, ensuring our findings reflect realistic user conditions. We harness this pipeline to conduct an empirical case study on a flagship Android device, the Samsung Galaxy S25 Ultra, establishing foundational hypotheses regarding the trade-offs between generation quality, performance, and resource consumption. Our investigation uncovered a counter-intuitive quantization-energy paradox. While modern importance-aware quantization successfully reduces memory footprints to fit larger models into RAM, we found it yields negligible energy savings compared to standard mixed-precision methods. This proves that for battery life, the architecture of the model, not its quantization scheme, is the decisive factor. We further identified that Mixture-of-Experts (MoE) architectures defy the standard size-energy trend, offering the storage capacity of a 7B model while maintaining the lower energy profile of a 1B to 2B model. Finally, an analysis of these multi-objective trade-offs reveals a pragmatic sweet spot of mid-sized models, such as Qwen2.5-3B, that effectively balance response quality with sustainable energy consumption.

Evaluating Interactive 2D Visualization as a Sample Selection Strategy for Biomedical Time-Series Data Annotation

Reliable machine-learning models in biomedical settings depend on accurate labels, yet annotating biomedical time-series data remains challenging. Algorithmic sample selection may support annotation, but evidence from studies involving real human annotators is scarce. Consequently, we compare three sample selection methods for annotation: random sampling (RND), farthest-first traversal (FAFT), and a graphical user interface-based method enabling exploration of complementary 2D visualizations (2DVs) of high-dimensional data. We evaluated the methods across four classification tasks in infant motility assessment (IMA) and speech emotion recognition (SER). Twelve annotators, categorized as experts or non-experts, performed data annotation under a limited annotation budget, and post-annotation experiments were conducted to evaluate the sampling methods. Across all classification tasks, 2DV performed best when aggregating labels across annotators. In IMA, 2DV most effectively captured rare classes, but also exhibited greater annotator-to-annotator label distribution variability resulting from the limited annotation budget, decreasing classification performance when models were trained on individual annotators' labels; in these cases, FAFT excelled. For SER, 2DV outperformed the other methods among expert annotators and matched their performance for non-experts in the individual-annotator setting. A failure risk analysis revealed that RND was the safest choice when annotator count or annotator expertise was uncertain, whereas 2DV had the highest risk due to its greater label distribution variability. Furthermore, post-experiment interviews indicated that 2DV made the annotation task more interesting and enjoyable. Overall, 2DV-based sampling appears promising for biomedical time-series data annotation, particularly when the annotation budget is not highly constrained.

Generation Is Compression: Zero-Shot Video Coding via Stochastic Rectified Flow

Existing generative video compression methods use generative models only as post-hoc reconstruction modules atop conventional codecs. We propose \emph{Generative Video Codec} (GVC), a zero-shot framework that turns a pretrained video generative model into the codec itself: the transmitted bitstream directly specifies the generative decoding trajectory, with no retraining required. To enable this, we convert the deterministic rectified-flow ODE of modern video foundation models into an equivalent SDE at inference time, unlocking per-step stochastic injection points for codebook-driven compression. Building on this unified backbone, we instantiate three complementary conditioning strategies -- \emph{Image-to-Video} (I2V) with adaptive tail-frame atom allocation, \emph{Text-to-Video} (T2V) operating at near-zero side information as a pure generative prior, and \emph{First-Last-Frame-to-Video} (FLF2V) with boundary-sharing GOP chaining for dual-anchor temporal control. Together, these variants span a principled trade-off space between spatial fidelity, temporal coherence, and compression efficiency. Experiments on standard benchmarks show that GVC achieves high-quality reconstruction below 0.002\,bpp while supporting flexible bitrate control through a single hyperparameter.

Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering

Large Language Models (LLMs) have shown impressive capabilities across software engineering tasks, including question answering (QA). However, most studies and benchmarks focus on isolated functions or single-file snippets, overlooking the challenges of real-world program comprehension, which often spans multiple files and system-level dependencies. In this work, we introduce StackRepoQA, the first multi-project, repository-level question answering dataset constructed from 1,318 real developer questions and accepted answers across 134 open-source Java projects. Using this dataset, we systematically evaluate two widely used LLMs (Claude 3.5 Sonnet and GPT-4o) under both direct prompting and agentic configurations. We compare baseline performance with retrieval-augmented generation methods that leverage file-level retrieval and graph-based representations of structural dependencies. Our results show that LLMs achieve moderate accuracy at baseline, with performance improving when structural signals are incorporated. Nonetheless, overall accuracy remains limited for repository-scale comprehension. The analysis reveals that high scores often result from verbatim reproduction of Stack Overflow answers rather than genuine reasoning. To our knowledge, this is the first empirical study to provide such evidence in repository-level QA. We release StackRepoQA to encourage further research into benchmarks, evaluation protocols, and augmentation strategies that disentangle memorization from reasoning, advancing LLMs as reliable tool for repository-scale program comprehension.

AI Models

microsoft/harrier-oss-v1-0.6b


tags:

  • mteb
  • sentence-transformers
  • transformers language:
  • multilingual
  • af
  • am
  • ar
  • as
  • az
  • be
  • bg
  • bn
  • br
  • bs
  • ca
  • cs
  • cy
  • da
  • de
  • el
  • en
  • eo
  • es
  • et
  • eu
  • fa
  • fi
  • fr
  • fy
  • ga
  • gd
  • gl
  • gu
  • ha
  • he
  • hi
  • hr
  • hu
  • hy
  • id
  • is
  • it
  • ja
  • jv
  • ka
  • kk
  • km
  • kn
  • ko
  • ku
  • ky
  • la
  • lo
  • lt
  • lv
  • mg
  • mk
  • ml
  • mn
  • mr
  • ms
  • my
  • ne
  • nl
  • 'no'
  • om
  • or
  • pa
  • pl
  • ps
  • pt
  • ro
  • ru
  • sa
  • sd
  • si
  • sk
  • sl
  • so
  • sq
  • sr
  • su
  • sv
  • sw
  • ta
  • te
  • th
  • tl
  • tr
  • ug
  • uk
  • ur
  • uz
  • vi
  • xh
  • yi
  • zh license: mit

harrier-oss-v1

harrier-oss-v1 is a family of multilingual text embedding models developed by Microsoft. The models use decoder-only architectures with last-token pooling and L2 normalization to produce dense text embeddings. They can be applied to a wide range of tasks, including but not limited to retrieval, clustering, semantic similarity, classification, bitext mining, and reranking. The models achieve state-of-the-art results on the Multilingual MTEB v2 benchmark as of the release date.

| Model | Parameters | Embedding Dimension | Max Tokens | MTEB v2 Score | |-----------------------------------------------------------------------------|------------|---------------------|------------|---------------| | harrier-oss-v1-270m | 270M | 640 | 32,768 | 66.5 | | harrier-oss-v1-0.6b | 0.6B | 1,024 | 32,768 | 69.0 | | harrier-oss-v1-27b | 27B | 5,376 | 32,768 | 74.3 |

Training

All models are trained with contrastive learning objectives on a large-scale mixture of multilingual datasets covering diverse tasks. The 270m and 0.6b variants are additionally trained with knowledge distillation from larger embedding models.

Usage

Below is an example to encode queries and passages from the MS-MARCO passage ranking dataset.

Sentence Transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("microsoft/harrier-oss-v1-0.6b", model_kwargs={"dtype": "auto"})

queries = [
    "how much protein should a female eat",
    "summit define",
]
documents = [
    "As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
    "Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments."
]

query_embeddings = model.encode(queries, prompt_name="web_search_query")
document_embeddings = model.encode(documents)

scores = (query_embeddings @ document_embeddings.T) * 100
print(scores.tolist())

Have a look at config_sentence_transformers.json for the prompts that are pre-configured, such as web_search_query, sts_query, and bitext_query. You can also use a custom instruction directly via e.g. model.encode(queries, prompt="Instruct: Retrieve semantically similar text\nQuery: ").

Transformers

import torch
import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel


def last_token_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]


def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery: {query}'


# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = [
    get_detailed_instruct(task, 'how much protein should a female eat'),
    get_detailed_instruct(task, 'summit define')
]
# No need to add instruction for retrieval documents
documents = [
    "As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
    "Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments."
]
input_texts = queries + documents

tokenizer = AutoTokenizer.from_pretrained('microsoft/harrier-oss-v1-0.6b')
model = AutoModel.from_pretrained('microsoft/harrier-oss-v1-0.6b', dtype='auto')
model.eval()
model.cuda()

max_length = 32768
# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=max_length, padding=True, truncation=True, return_tensors='pt')
batch_dict = {k: v.cuda() for k, v in batch_dict.items()}

outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())

Supported Languages

The models are trained on multilingual data and support a wide range of languages, including but not limited to: Arabic, Bulgarian, Catalan, Czech, Danish, German, Greek, English, Spanish, Estonian, Persian, Finnish, French, Hebrew, Hindi, Croatian, Hungarian, Indonesian, Italian, Japanese, Korean, Lithuanian, Latvian, Macedonian, Malay, Dutch, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Albanian, Serbian, Swedish, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Chinese.

Evaluation

Please follow the mteb repository on how to reproduce our scores. The evaluation prompts used for each task are also available at mteb_v2_eval_prompts.json.

FAQ

1. Do I need to add instructions to the query?

Yes, this is how the model is trained, otherwise you will see a performance degradation. The task definition should be a one-sentence instruction that describes the task. This is a way to customize text embeddings for different scenarios through natural language instructions.

On the other hand, there is no need to add instructions to the document side.

2. Why are my reproduced results slightly different from reported in the model card?

Different versions of transformers and pytorch could cause negligible but non-zero performance differences.

3. What pooling strategy does this model use?

The model uses last-token pooling — the embedding of the last non-padding token is used as the sentence representation. The embedding is then L2-normalized. This is handled automatically when using Sentence Transformers.

Author: microsoft

Likes: 75

Downloads: 0

Tags: sentence-transformers, safetensors, qwen3, feature-extraction, mteb, transformers, multilingual, af, am, ar, as, az, be, bg, bn, br, bs, ca, cs, cy, da, de, el, en, eo, es, et, eu, fa, fi, fr, fy, ga, gd, gl, gu, ha, he, hi, hr, hu, hy, id, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lo, lt, lv, mg, mk, ml, mn, mr, ms, my, ne, nl, no, om, or, pa, pl, ps, pt, ro, ru, sa, sd, si, sk, sl, so, sq, sr, su, sv, sw, ta, te, th, tl, tr, ug, uk, ur, uz, vi, xh, yi, zh, license:mit, text-embeddings-inference, endpoints_compatible, region:us

microsoft/harrier-oss-v1-270m


tags:

  • mteb
  • sentence-transformers
  • transformers language:
  • multilingual
  • af
  • am
  • ar
  • as
  • az
  • be
  • bg
  • bn
  • br
  • bs
  • ca
  • cs
  • cy
  • da
  • de
  • el
  • en
  • eo
  • es
  • et
  • eu
  • fa
  • fi
  • fr
  • fy
  • ga
  • gd
  • gl
  • gu
  • ha
  • he
  • hi
  • hr
  • hu
  • hy
  • id
  • is
  • it
  • ja
  • jv
  • ka
  • kk
  • km
  • kn
  • ko
  • ku
  • ky
  • la
  • lo
  • lt
  • lv
  • mg
  • mk
  • ml
  • mn
  • mr
  • ms
  • my
  • ne
  • nl
  • 'no'
  • om
  • or
  • pa
  • pl
  • ps
  • pt
  • ro
  • ru
  • sa
  • sd
  • si
  • sk
  • sl
  • so
  • sq
  • sr
  • su
  • sv
  • sw
  • ta
  • te
  • th
  • tl
  • tr
  • ug
  • uk
  • ur
  • uz
  • vi
  • xh
  • yi
  • zh license: mit

harrier-oss-v1

harrier-oss-v1 is a family of multilingual text embedding models developed by Microsoft. The models use decoder-only architectures with last-token pooling and L2 normalization to produce dense text embeddings. They can be applied to a wide range of tasks, including but not limited to retrieval, clustering, semantic similarity, classification, bitext mining, and reranking. The models achieve state-of-the-art results on the Multilingual MTEB v2 benchmark as of the release date.

| Model | Parameters | Embedding Dimension | Max Tokens | MTEB v2 Score | |-----------------------------------------------------------------------------|------------|---------------------|------------|---------------| | harrier-oss-v1-270m | 270M | 640 | 32,768 | 66.5 | | harrier-oss-v1-0.6b | 0.6B | 1,024 | 32,768 | 69.0 | | harrier-oss-v1-27b | 27B | 5,376 | 32,768 | 74.3 |

Training

All models are trained with contrastive learning objectives on a large-scale mixture of multilingual datasets covering diverse tasks. The 270m and 0.6b variants are additionally trained with knowledge distillation from larger embedding models.

Usage

Below is an example to encode queries and passages from the MS-MARCO passage ranking dataset.

Sentence Transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("microsoft/harrier-oss-v1-270m", model_kwargs={"dtype": "auto"})

queries = [
    "how much protein should a female eat",
    "summit define",
]
documents = [
    "As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
    "Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments."
]

query_embeddings = model.encode(queries, prompt_name="web_search_query")
document_embeddings = model.encode(documents)

scores = (query_embeddings @ document_embeddings.T) * 100
print(scores.tolist())

Have a look at config_sentence_transformers.json for the prompts that are pre-configured, such as web_search_query, sts_query, and bitext_query. You can also use a custom instruction directly via e.g. model.encode(queries, prompt="Instruct: Retrieve semantically similar text\nQuery: ").

Transformers

import torch
import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel


def last_token_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]


def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery: {query}'


# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = [
    get_detailed_instruct(task, 'how much protein should a female eat'),
    get_detailed_instruct(task, 'summit define')
]
# No need to add instruction for retrieval documents
documents = [
    "As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
    "Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments."
]
input_texts = queries + documents

tokenizer = AutoTokenizer.from_pretrained('microsoft/harrier-oss-v1-270m')
model = AutoModel.from_pretrained('microsoft/harrier-oss-v1-270m', dtype='auto')
model.eval()
model.cuda()

max_length = 32768
# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=max_length, padding=True, truncation=True, return_tensors='pt')
batch_dict = {k: v.cuda() for k, v in batch_dict.items()}

outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())

Supported Languages

The models are trained on multilingual data and support a wide range of languages, including but not limited to: Arabic, Bulgarian, Catalan, Czech, Danish, German, Greek, English, Spanish, Estonian, Persian, Finnish, French, Hebrew, Hindi, Croatian, Hungarian, Indonesian, Italian, Japanese, Korean, Lithuanian, Latvian, Macedonian, Malay, Dutch, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Albanian, Serbian, Swedish, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Chinese.

Evaluation

Please follow the mteb repository on how to reproduce our scores. The evaluation prompts used for each task are also available at mteb_v2_eval_prompts.json.

FAQ

1. Do I need to add instructions to the query?

Yes, this is how the model is trained, otherwise you will see a performance degradation. The task definition should be a one-sentence instruction that describes the task. This is a way to customize text embeddings for different scenarios through natural language instructions.

On the other hand, there is no need to add instructions to the document side.

2. Why are my reproduced results slightly different from reported in the model card?

Different versions of transformers and pytorch could cause negligible but non-zero performance differences.

3. What pooling strategy does this model use?

The model uses last-token pooling — the embedding of the last non-padding token is used as the sentence representation. The embedding is then L2-normalized. This is handled automatically when using Sentence Transformers.

Author: microsoft

Likes: 43

Downloads: 0

Tags: sentence-transformers, safetensors, gemma3_text, feature-extraction, mteb, transformers, multilingual, af, am, ar, as, az, be, bg, bn, br, bs, ca, cs, cy, da, de, el, en, eo, es, et, eu, fa, fi, fr, fy, ga, gd, gl, gu, ha, he, hi, hr, hu, hy, id, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lo, lt, lv, mg, mk, ml, mn, mr, ms, my, ne, nl, no, om, or, pa, pl, ps, pt, ro, ru, sa, sd, si, sk, sl, so, sq, sr, su, sv, sw, ta, te, th, tl, tr, ug, uk, ur, uz, vi, xh, yi, zh, license:mit, text-embeddings-inference, endpoints_compatible, region:us

microsoft/harrier-oss-v1-27b


tags:

  • mteb
  • sentence-transformers
  • transformers language:
  • multilingual
  • af
  • am
  • ar
  • as
  • az
  • be
  • bg
  • bn
  • br
  • bs
  • ca
  • cs
  • cy
  • da
  • de
  • el
  • en
  • eo
  • es
  • et
  • eu
  • fa
  • fi
  • fr
  • fy
  • ga
  • gd
  • gl
  • gu
  • ha
  • he
  • hi
  • hr
  • hu
  • hy
  • id
  • is
  • it
  • ja
  • jv
  • ka
  • kk
  • km
  • kn
  • ko
  • ku
  • ky
  • la
  • lo
  • lt
  • lv
  • mg
  • mk
  • ml
  • mn
  • mr
  • ms
  • my
  • ne
  • nl
  • 'no'
  • om
  • or
  • pa
  • pl
  • ps
  • pt
  • ro
  • ru
  • sa
  • sd
  • si
  • sk
  • sl
  • so
  • sq
  • sr
  • su
  • sv
  • sw
  • ta
  • te
  • th
  • tl
  • tr
  • ug
  • uk
  • ur
  • uz
  • vi
  • xh
  • yi
  • zh license: mit

harrier-oss-v1

harrier-oss-v1 is a family of multilingual text embedding models developed by Microsoft. The models use decoder-only architectures with last-token pooling and L2 normalization to produce dense text embeddings. They can be applied to a wide range of tasks, including but not limited to retrieval, clustering, semantic similarity, classification, bitext mining, and reranking. The models achieve state-of-the-art results on the Multilingual MTEB v2 benchmark as of the release date.

| Model | Parameters | Embedding Dimension | Max Tokens | MTEB v2 Score | |-----------------------------------------------------------------------------|------------|---------------------|------------|---------------| | harrier-oss-v1-270m | 270M | 640 | 32,768 | 66.5 | | harrier-oss-v1-0.6b | 0.6B | 1,024 | 32,768 | 69.0 | | harrier-oss-v1-27b | 27B | 5,376 | 32,768 | 74.3 |

Training

All models are trained with contrastive learning objectives on a large-scale mixture of multilingual datasets covering diverse tasks. The 270m and 0.6b variants are additionally trained with knowledge distillation from larger embedding models.

Usage

Below is an example to encode queries and passages from the MS-MARCO passage ranking dataset.

Sentence Transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("microsoft/harrier-oss-v1-27b", model_kwargs={"dtype": "auto"})

queries = [
    "how much protein should a female eat",
    "summit define",
]
documents = [
    "As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
    "Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments."
]

query_embeddings = model.encode(queries, prompt_name="web_search_query")
document_embeddings = model.encode(documents)

scores = (query_embeddings @ document_embeddings.T) * 100
print(scores.tolist())

Have a look at config_sentence_transformers.json for the prompts that are pre-configured, such as web_search_query, sts_query, and bitext_query. You can also use a custom instruction directly via e.g. model.encode(queries, prompt="Instruct: Retrieve semantically similar text\nQuery: ").

Transformers

import torch
import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel


def last_token_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]


def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery: {query}'


# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = [
    get_detailed_instruct(task, 'how much protein should a female eat'),
    get_detailed_instruct(task, 'summit define')
]
# No need to add instruction for retrieval documents
documents = [
    "As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
    "Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments."
]
input_texts = queries + documents

tokenizer = AutoTokenizer.from_pretrained('microsoft/harrier-oss-v1-27b')
model = AutoModel.from_pretrained('microsoft/harrier-oss-v1-27b', dtype='auto')
model.eval()
model.cuda()

max_length = 32768
# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=max_length, padding=True, truncation=True, return_tensors='pt')
batch_dict = {k: v.cuda() for k, v in batch_dict.items()}

outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())

Supported Languages

The models are trained on multilingual data and support a wide range of languages, including but not limited to: Arabic, Bulgarian, Catalan, Czech, Danish, German, Greek, English, Spanish, Estonian, Persian, Finnish, French, Hebrew, Hindi, Croatian, Hungarian, Indonesian, Italian, Japanese, Korean, Lithuanian, Latvian, Macedonian, Malay, Dutch, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Albanian, Serbian, Swedish, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Chinese.

Evaluation

Please follow the mteb repository on how to reproduce our scores. The evaluation prompts used for each task are also available at mteb_v2_eval_prompts.json.

FAQ

1. Do I need to add instructions to the query?

Yes, this is how the model is trained, otherwise you will see a performance degradation. The task definition should be a one-sentence instruction that describes the task. This is a way to customize text embeddings for different scenarios through natural language instructions.

On the other hand, there is no need to add instructions to the document side.

2. Why are my reproduced results slightly different from reported in the model card?

Different versions of transformers and pytorch could cause negligible but non-zero performance differences.

3. What pooling strategy does this model use?

The model uses last-token pooling — the embedding of the last non-padding token is used as the sentence representation. The embedding is then L2-normalized. This is handled automatically when using Sentence Transformers.

Author: microsoft

Likes: 40

Downloads: 0

Tags: sentence-transformers, safetensors, gemma3_text, feature-extraction, mteb, transformers, multilingual, af, am, ar, as, az, be, bg, bn, br, bs, ca, cs, cy, da, de, el, en, eo, es, et, eu, fa, fi, fr, fy, ga, gd, gl, gu, ha, he, hi, hr, hu, hy, id, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lo, lt, lv, mg, mk, ml, mn, mr, ms, my, ne, nl, no, om, or, pa, pl, ps, pt, ro, ru, sa, sd, si, sk, sl, so, sq, sr, su, sv, sw, ta, te, th, tl, tr, ug, uk, ur, uz, vi, xh, yi, zh, license:mit, text-embeddings-inference, endpoints_compatible, region:us

meituan-longcat/LongCat-AudioDiT-3.5B


license: mit language:

  • zh
  • en

LongCat-AudioDiT: High-Fidelity Diffusion Text-to-Speech in the Waveform Latent Space

<div align="center"> <img src="./LongCat-AudioDiT.svg" width="45%" alt="LongCat-AudioDiT" /> </div> <hr> <div align="center" style="line-height: 1;"> <a href="https://github.com/meituan-longcat/LongCat-AudioDiT/blob/main/LongCat-AudioDiT.pdf"> <img alt="License" src="https://img.shields.io/badge/Paper-LongCatAudioDiT-blue" style="display: inline-block; vertical-align: middle;"/> </a> <a href="https://github.com/meituan-longcat/LongCat-AudioDiT" target="_blank" style="margin: 2px;"> <img alt="GitHub" src="https://img.shields.io/badge/GitHub-LongCatAudioDiT-white?logo=github&logoColor=white&color=a4b5d5" style="display: inline-block; vertical-align: middle;"/> </a> </div> <div align="center" style="line-height: 1;"> <a href="https://huggingface.co/meituan-longcat/LongCat-AudioDiT-3.5B" target="_blank" style="margin: 2px;"> <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-LongCatAudioDiT3.5B-ffc107?color=ffc107&logoColor=white" style="display: inline-block; vertical-align: middle;"/> </a> <a href="https://huggingface.co/meituan-longcat/LongCat-AudioDiT-1B" target="_blank" style="margin: 2px;"> <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-LongCatAudioDiT1B-ffc107?color=ffc107&logoColor=white" style="display: inline-block; vertical-align: middle;"/> </a> </div> <div align="center" style="line-height: 1;"> <a href="https://github.com/meituan-longcat/LongCat-AudioDiT/blob/main/assets/wechat_official_accounts.png" target="_blank" style="margin: 2px;"> <img alt="Wechat" src="https://img.shields.io/badge/WeChat-LongCat-brightgreen?logo=wechat&logoColor=white" style="display: inline-block; vertical-align: middle;"/> </a> <a href="https://x.com/Meituan_LongCat" target="_blank" style="margin: 2px;"> <img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-LongCat-white?logo=x&logoColor=white" style="display: inline-block; vertical-align: middle;"/> </a> <a href="https://github.com/meituan-longcat/LongCat-AudioDiT/blob/main/LICENSE" style="margin: 2px;"> <img alt="License" src="https://img.shields.io/badge/License-MIT-f5de53?&color=f5de53" style="display: inline-block; vertical-align: middle;"/> </a> </div>

Introduction

LongCat-AudioDiT is a state-of-the-art (SOTA) diffusion-based text-to-speech (TTS) model that directly operates on the waveform latent space.

Abstract: We present LongCat-TTS, a novel, non-autoregressive diffusion-based text-to-speech (TTS) model that achieves state-of-the-art (SOTA) performance. Unlike previous methods that rely on intermediate acoustic representations such as mel-spectrograms, the core innovation of LongCat-TTS lies in operating directly within the waveform latent space. This approach effectively mitigates compounding errors and drastically simplifies the TTS pipeline, requiring only a waveform variational autoencoder (Wav-VAE) and a diffusion backbone. Furthermore, we introduce two critical improvements to the inference process: first, we identify and rectify a long-standing training-inference mismatch; second, we replace traditional classifier-free guidance with adaptive projection guidance to elevate generation quality. Experimental results demonstrate that, despite the absence of complex multi-stage training pipelines or high-quality human-annotated datasets, LongCat-TTS achieves SOTA zero-shot voice cloning performance on the Seed benchmark while maintaining competitive intelligibility. Specifically, our largest variant, LongCat-TTS-3.5B, outperforms the previous SOTA model (Seed-TTS), improving the speaker similarity (SIM) scores from 0.809 to 0.818 on Seed-ZH, and from 0.776 to 0.797 on Seed-Hard. Finally, through comprehensive ablation studies and systematic analysis, we validate the effectiveness of our proposed modules. Notably, we investigate the interplay between the Wav-VAE and the TTS backbone, revealing the counterintuitive finding that superior reconstruction fidelity in the Wav-VAE does not necessarily lead to better overall TTS performance. Code and model weights are released to foster further research within the speech community.

<div align="center"> <img src="./architecture.png" width="75%" alt="LongCat-AudioDiT" /> </div>

This repository provides the HuggingFace-compatible implementation, including model definition, weight conversion, and inference scripts.

Experimental Results on Seed Benchmark

LongCat-AudioDiT obtains state-of-the-art (SOTA) voice cloning performance on the Seed-benchmark, surpassing both close-source and open-source modles.

| Model | ZH CER (%) ↓ | ZH SIM ↑ | EN WER (%) ↓ | EN SIM ↑ | ZH-Hard CER (%) ↓ | ZH-Hard SIM ↑ | |:---|:---:|:---:|:---:|:---:|:---:|:---:| | GT | 1.26 | 0.755 | 2.14 | 0.734 | - | - | | Seed-DiT | 1.18 | 0.809 | 1.73 | 0.790 | - | - | | MaskGCT | 2.27 | 0.774 | 2.62 | 0.714 | 10.27 | 0.748 | | E2 TTS | 1.97 | 0.730 | 2.19 | 0.710 | - | - | | F5 TTS | 1.56 | 0.741 | 1.83 | 0.647 | 8.67 | 0.713 | | F5R-TTS | 1.37 | 0.754 | - | - | 8.79 | 0.718 | | ZipVoice | 1.40 | 0.751 | 1.64 | 0.668 | - | - | | Seed-ICL | 1.12 | 0.796 | 2.25 | 0.762 | 7.59 | 0.776 | | SparkTTS | 1.20 | 0.672 | 1.98 | 0.584 | - | - | | FireRedTTS | 1.51 | 0.635 | 3.82 | 0.460 | 17.45 | 0.621 | | Qwen2.5-Omni | 1.70 | 0.752 | 2.72 | 0.632 | 7.97 | 0.747 | | Qwen2.5-Omni_RL | 1.42 | 0.754 | 2.33 | 0.641 | 6.54 | 0.752 | | CosyVoice | 3.63 | 0.723 | 4.29 | 0.609 | 11.75 | 0.709 | | CosyVoice2 | 1.45 | 0.748 | 2.57 | 0.652 | 6.83 | 0.724 | | FireRedTTS-1S | 1.05 | 0.750 | 2.17 | 0.660 | 7.63 | 0.748 | | CosyVoice3-1.5B | 1.12 | 0.781 | 2.21 | 0.720 | 5.83 | 0.758 | | IndexTTS2 | 1.03 | 0.765 | 2.23 | 0.706 | 7.12 | 0.755 | | DiTAR | 1.02 | 0.753 | 1.69 | 0.735 | - | - | | MiniMax-Speech | 0.99 | 0.799 | 1.90 | 0.738 | - | - | | VoxCPM | 0.93 | 0.772 | 1.85 | 0.729 | 8.87 | 0.730 | | MOSS-TTS | 1.20 | 0.788 | 1.85 | 0.734 | - | - | | Qwen3-TTS | 1.22 | 0.770 | 1.23 | 0.717 | 6.76 | 0.748 | | CosyVoice3.5 | 0.87 | 0.797 | 1.57 | 0.738 | 5.71 | 0.786 | | LongCat-AudioDiT-1B | 1.18 | 0.812 | 1.78 | 0.762 | 6.33 | 0.787 | | LongCat-AudioDiT-3.5B | 1.09 | 0.818 | 1.50 | 0.786 | 6.04 | 0.797 |

Installation

pip install -r requirements.txt

CLI Inference

# TTS
python inference.py --text "今天晴暖转阴雨,空气质量优至良,空气相对湿度较低。" --output_audio output.wav --model_dir meituan-longcat/LongCat-AudioDiT-1B

# Voice cloning
python inference.py \
    --text "今天晴暖转阴雨,空气质量优至良,空气相对湿度较低。" \
    --prompt_text "小偷却一点也不气馁,继续在抽屉里翻找。" \
    --prompt_audio assets/prompt.wav \
    --output_audio output.wav \
    --model_dir meituan-longcat/LongCat-AudioDiT-1B \
    --guidance_method apg

# Batch inference (SeedTTS eval format, one item per line: uid|prompt_text|prompt_wav_path|gen_text)
python batch_inference.py \
    --lst /path/to/meta.lst \
    --output_dir /path/to/output \
    --model_dir meituan-longcat/LongCat-AudioDiT-1B \
    --guidance_method apg

Inference (Python API)

1. TTS

import audiodit  # auto-registers with transformers
from audiodit import AudioDiTModel
from transformers import AutoTokenizer
import torch, soundfile as sf

# Load model
model = AudioDiTModel.from_pretrained("meituan-longcat/LongCat-AudioDiT-1B").to("cuda")
model.vae.to_half()  # VAE runs in fp16 (matching original)
model.eval()

tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder_model)

# Zero-shot synthesis
inputs = tokenizer(["今天晴暖转阴雨,空气质量优至良,空气相对湿度较低。"], padding="longest", return_tensors="pt")
output = model(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    duration=62,  # latent frames
    steps=16,
    cfg_strength=4.0,
    guidance_method="cfg",  # or "apg"
    seed=1024,
)
sf.write("output.wav", output.waveform.squeeze().cpu().numpy(), 24000)

2. Voice Cloning (with prompt audio)

import librosa, torch

# Load prompt audio
audio, _ = librosa.load("assets/prompt.wav", sr=24000, mono=True)
prompt_wav = torch.from_numpy(audio).unsqueeze(0).unsqueeze(0)  # (1, 1, T)

# Concatenate prompt_text + gen_text for the text encoder
prompt_text = "小偷却一点也不气馁,继续在抽屉里翻找。"
gen_text = "今天晴暖转阴雨,空气质量优至良,空气相对湿度较低。"
inputs = tokenizer([f"{prompt_text} {gen_text}"], padding="longest", return_tensors="pt")

output = model(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    prompt_audio=prompt_wav,
    duration=138,  # prompt_frames + gen_frames
    steps=16,
    cfg_strength=4.0,
    guidance_method="apg",
    seed=1024,
)

License Agreement

This repository, including both the model weights and the source code, is released under the MIT License.

Any contributions to this repository are licensed under the MIT License, unless otherwise stated. This license does not grant any rights to use Meituan trademarks or patents.

For details, see the LICENSE file.

Author: meituan-longcat

Likes: 21

Downloads: 0

Tags: safetensors, audiodit, zh, en, license:mit, region:us

meituan-longcat/LongCat-AudioDiT-1B


license: mit language:

  • zh
  • en

LongCat-AudioDiT: High-Fidelity Diffusion Text-to-Speech in the Waveform Latent Space

<div align="center"> <img src="./LongCat-AudioDiT.svg" width="45%" alt="LongCat-AudioDiT" /> </div> <hr> <div align="center" style="line-height: 1;"> <a href="https://github.com/meituan-longcat/LongCat-AudioDiT/blob/main/LongCat-AudioDiT.pdf"> <img alt="License" src="https://img.shields.io/badge/Paper-LongCatAudioDiT-blue" style="display: inline-block; vertical-align: middle;"/> </a> <a href="https://github.com/meituan-longcat/LongCat-AudioDiT" target="_blank" style="margin: 2px;"> <img alt="GitHub" src="https://img.shields.io/badge/GitHub-LongCatAudioDiT-white?logo=github&logoColor=white&color=a4b5d5" style="display: inline-block; vertical-align: middle;"/> </a> </div> <div align="center" style="line-height: 1;"> <a href="https://huggingface.co/meituan-longcat/LongCat-AudioDiT-3.5B" target="_blank" style="margin: 2px;"> <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-LongCatAudioDiT3.5B-ffc107?color=ffc107&logoColor=white" style="display: inline-block; vertical-align: middle;"/> </a> <a href="https://huggingface.co/meituan-longcat/LongCat-AudioDiT-1B" target="_blank" style="margin: 2px;"> <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-LongCatAudioDiT1B-ffc107?color=ffc107&logoColor=white" style="display: inline-block; vertical-align: middle;"/> </a> </div> <div align="center" style="line-height: 1;"> <a href="https://github.com/meituan-longcat/LongCat-AudioDiT/blob/main/assets/wechat_official_accounts.png" target="_blank" style="margin: 2px;"> <img alt="Wechat" src="https://img.shields.io/badge/WeChat-LongCat-brightgreen?logo=wechat&logoColor=white" style="display: inline-block; vertical-align: middle;"/> </a> <a href="https://x.com/Meituan_LongCat" target="_blank" style="margin: 2px;"> <img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-LongCat-white?logo=x&logoColor=white" style="display: inline-block; vertical-align: middle;"/> </a> <a href="https://huggingface.co/meituan-longcat/LongCat-AudioDiT-1B/blob/main/LICENSE" style="margin: 2px;"> <img alt="License" src="https://img.shields.io/badge/License-MIT-f5de53?&color=f5de53" style="display: inline-block; vertical-align: middle;"/> </a> </div>

Introduction

LongCat-AudioDiT is a state-of-the-art (SOTA) diffusion-based text-to-speech (TTS) model that directly operates on the waveform latent space.

Abstract: We present LongCat-TTS, a novel, non-autoregressive diffusion-based text-to-speech (TTS) model that achieves state-of-the-art (SOTA) performance. Unlike previous methods that rely on intermediate acoustic representations such as mel-spectrograms, the core innovation of LongCat-TTS lies in operating directly within the waveform latent space. This approach effectively mitigates compounding errors and drastically simplifies the TTS pipeline, requiring only a waveform variational autoencoder (Wav-VAE) and a diffusion backbone. Furthermore, we introduce two critical improvements to the inference process: first, we identify and rectify a long-standing training-inference mismatch; second, we replace traditional classifier-free guidance with adaptive projection guidance to elevate generation quality. Experimental results demonstrate that, despite the absence of complex multi-stage training pipelines or high-quality human-annotated datasets, LongCat-TTS achieves SOTA zero-shot voice cloning performance on the Seed benchmark while maintaining competitive intelligibility. Specifically, our largest variant, LongCat-TTS-3.5B, outperforms the previous SOTA model (Seed-TTS), improving the speaker similarity (SIM) scores from 0.809 to 0.818 on Seed-ZH, and from 0.776 to 0.797 on Seed-Hard. Finally, through comprehensive ablation studies and systematic analysis, we validate the effectiveness of our proposed modules. Notably, we investigate the interplay between the Wav-VAE and the TTS backbone, revealing the counterintuitive finding that superior reconstruction fidelity in the Wav-VAE does not necessarily lead to better overall TTS performance. Code and model weights are released to foster further research within the speech community.

<div align="center"> <img src="./architecture.png" width="75%" alt="LongCat-AudioDiT" /> </div>

This repository provides the HuggingFace-compatible implementation, including model definition, weight conversion, and inference scripts.

Experimental Results on Seed Benchmark

LongCat-AudioDiT obtains state-of-the-art (SOTA) voice cloning performance on the Seed-benchmark, surpassing both close-source and open-source modles.

| Model | ZH CER (%) ↓ | ZH SIM ↑ | EN WER (%) ↓ | EN SIM ↑ | ZH-Hard CER (%) ↓ | ZH-Hard SIM ↑ | |:---|:---:|:---:|:---:|:---:|:---:|:---:| | GT | 1.26 | 0.755 | 2.14 | 0.734 | - | - | | Seed-DiT | 1.18 | 0.809 | 1.73 | 0.790 | - | - | | MaskGCT | 2.27 | 0.774 | 2.62 | 0.714 | 10.27 | 0.748 | | E2 TTS | 1.97 | 0.730 | 2.19 | 0.710 | - | - | | F5 TTS | 1.56 | 0.741 | 1.83 | 0.647 | 8.67 | 0.713 | | F5R-TTS | 1.37 | 0.754 | - | - | 8.79 | 0.718 | | ZipVoice | 1.40 | 0.751 | 1.64 | 0.668 | - | - | | Seed-ICL | 1.12 | 0.796 | 2.25 | 0.762 | 7.59 | 0.776 | | SparkTTS | 1.20 | 0.672 | 1.98 | 0.584 | - | - | | FireRedTTS | 1.51 | 0.635 | 3.82 | 0.460 | 17.45 | 0.621 | | Qwen2.5-Omni | 1.70 | 0.752 | 2.72 | 0.632 | 7.97 | 0.747 | | Qwen2.5-Omni_RL | 1.42 | 0.754 | 2.33 | 0.641 | 6.54 | 0.752 | | CosyVoice | 3.63 | 0.723 | 4.29 | 0.609 | 11.75 | 0.709 | | CosyVoice2 | 1.45 | 0.748 | 2.57 | 0.652 | 6.83 | 0.724 | | FireRedTTS-1S | 1.05 | 0.750 | 2.17 | 0.660 | 7.63 | 0.748 | | CosyVoice3-1.5B | 1.12 | 0.781 | 2.21 | 0.720 | 5.83 | 0.758 | | IndexTTS2 | 1.03 | 0.765 | 2.23 | 0.706 | 7.12 | 0.755 | | DiTAR | 1.02 | 0.753 | 1.69 | 0.735 | - | - | | MiniMax-Speech | 0.99 | 0.799 | 1.90 | 0.738 | - | - | | VoxCPM | 0.93 | 0.772 | 1.85 | 0.729 | 8.87 | 0.730 | | MOSS-TTS | 1.20 | 0.788 | 1.85 | 0.734 | - | - | | Qwen3-TTS | 1.22 | 0.770 | 1.23 | 0.717 | 6.76 | 0.748 | | CosyVoice3.5 | 0.87 | 0.797 | 1.57 | 0.738 | 5.71 | 0.786 | | LongCat-AudioDiT-1B | 1.18 | 0.812 | 1.78 | 0.762 | 6.33 | 0.787 | | LongCat-AudioDiT-3.5B | 1.09 | 0.818 | 1.50 | 0.786 | 6.04 | 0.797 |

Installation

pip install -r requirements.txt

CLI Inference

# TTS
python inference.py --text "今天晴暖转阴雨,空气质量优至良,空气相对湿度较低。" --output_audio output.wav --model_dir meituan-longcat/LongCat-AudioDiT-1B

# Voice cloning
python inference.py \
    --text "今天晴暖转阴雨,空气质量优至良,空气相对湿度较低。" \
    --prompt_text "小偷却一点也不气馁,继续在抽屉里翻找。" \
    --prompt_audio assets/prompt.wav \
    --output_audio output.wav \
    --model_dir meituan-longcat/LongCat-AudioDiT-1B \
    --guidance_method apg

# Batch inference (SeedTTS eval format, one item per line: uid|prompt_text|prompt_wav_path|gen_text)
python batch_inference.py \
    --lst /path/to/meta.lst \
    --output_dir /path/to/output \
    --model_dir meituan-longcat/LongCat-AudioDiT-1B \
    --guidance_method apg

Inference (Python API)

1. TTS

import audiodit  # auto-registers with transformers
from audiodit import AudioDiTModel
from transformers import AutoTokenizer
import torch, soundfile as sf

# Load model
model = AudioDiTModel.from_pretrained("meituan-longcat/LongCat-AudioDiT-1B").to("cuda")
model.vae.to_half()  # VAE runs in fp16 (matching original)
model.eval()

tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder_model)

# Zero-shot synthesis
inputs = tokenizer(["今天晴暖转阴雨,空气质量优至良,空气相对湿度较低。"], padding="longest", return_tensors="pt")
output = model(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    duration=62,  # latent frames
    steps=16,
    cfg_strength=4.0,
    guidance_method="cfg",  # or "apg"
    seed=1024,
)
sf.write("output.wav", output.waveform.squeeze().cpu().numpy(), 24000)

2. Voice Cloning (with prompt audio)

import librosa, torch

# Load prompt audio
audio, _ = librosa.load("assets/prompt.wav", sr=24000, mono=True)
prompt_wav = torch.from_numpy(audio).unsqueeze(0).unsqueeze(0)  # (1, 1, T)

# Concatenate prompt_text + gen_text for the text encoder
prompt_text = "小偷却一点也不气馁,继续在抽屉里翻找。"
gen_text = "今天晴暖转阴雨,空气质量优至良,空气相对湿度较低。"
inputs = tokenizer([f"{prompt_text} {gen_text}"], padding="longest", return_tensors="pt")

output = model(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    prompt_audio=prompt_wav,
    duration=138,  # prompt_frames + gen_frames
    steps=16,
    cfg_strength=4.0,
    guidance_method="apg",
    seed=1024,
)

License Agreement

This repository, including both the model weights and the source code, is released under the MIT License.

Any contributions to this repository are licensed under the MIT License, unless otherwise stated. This license does not grant any rights to use Meituan trademarks or patents.

For details, see the LICENSE file.

Author: meituan-longcat

Likes: 18

Downloads: 0

Tags: safetensors, audiodit, zh, en, license:mit, region:us

bosonai/higgs-audio-v3-8b-stt


license: apache-2.0 language:

  • en tags:
  • automatic-speech-recognition
  • whisper
  • qwen pipeline_tag: automatic-speech-recognition

Higgs Audio v3 8B STT

A speech-to-text model combining a Whisper-Large-v3 encoder with a Qwen3-8B decoder (8.91B total parameters), fine-tuned with LoRA on diverse ASR benchmarks.

Usage

Important: This model uses a custom architecture. You must pass trust_remote_code=True when loading.

import torch
from transformers import AutoConfig, AutoModel, AutoTokenizer

model = AutoModel.from_pretrained(
    "bosonai/higgs-audio-v3-8b-stt",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    attn_implementation="eager",
    device_map="cuda:0",
)
tokenizer = AutoTokenizer.from_pretrained("bosonai/higgs-audio-v3-8b-stt")

Requirements

torch
transformers>=4.51.0
boson_multimodal  # for audio preprocessing

Architecture

  • Encoder: Whisper-Large-v3 (frozen)
  • Decoder: Qwen3-8B (LoRA fine-tuned, merged)
  • Total parameters: 8.91B
  • Audio input: 16kHz mono WAV
  • Supports: Thinking mode for improved accuracy

Performance (ESB Benchmark — Full Scale)

| Dataset | WER | |---------|-----| | AMI | 9.85% | | Earnings22 | 9.01% | | GigaSpeech | 8.54% | | LibriSpeech Clean | 1.28% | | LibriSpeech Other | 2.41% | | SPGISpeech | 3.53% | | TED-LIUM | 2.74% | | VoxPopuli | 6.07% | | Average | 5.43% |

Author: bosonai

Likes: 4

Downloads: 0

Tags: safetensors, higgs_audio_3, automatic-speech-recognition, whisper, qwen, custom_code, en, license:apache-2.0, region:us

SciMaker/T-qwen3.5-4B

重新訓練 qwen3.5 的模型,解決了兩個問題,一個是提示使用繁體中對話時,幾乎不會出現簡體中文,另一個是這模型回答關於台灣相關問題時,觀點已經是正常的了,以世界的認知為中心,不會再以中國的觀點為中心! 把這系列取名 T-qwen ,這裡的T代表台灣跟繁體中文

Author: SciMaker

Likes: 4

Downloads: 0

Tags: gguf, endpoints_compatible, region:us, conversational

YTan2000/Qwen3.5-27B-TQ3_1S


license: mit language:

  • en library_name: gguf pipeline_tag: text-generation tags:
  • gguf
  • llama.cpp
  • qwen
  • qwen3.5
  • quantization
  • turboquant
  • wht base_model:
  • Qwen/Qwen3.5-27B

Qwen3.5-27B-TQ3_1S

Qwen3.5-27B-TQ3_1S is a GGUF quantization of Qwen/Qwen3.5-27B using TQ3_1S, a 3.5-bit weight format based on:

  • Walsh-Hadamard rotation
  • 8 centroid quantization
  • dual half-block scales

This release is aimed at one practical outcome:

  • near-Q4_0 quality
  • about 10% smaller than Q4_0
  • small enough to fit fully on a single 16 GB RTX 5060 Ti in the tested llama.cpp setup

Headline Result

Gold-standard wiki.test.raw pass, c=512, full 580 chunks:

| Format | PPL | Size | |---|---:|---:| | Q4_0 | 7.2431 +/- 0.0482 | 14.4 GB | | TQ3_1S | 7.2570 +/- 0.0480 | 12.9 GB |

Measured gap:

  • +0.0139 PPL
  • about 0.19%

Safe interpretation:

  • TQ3_1S is near-Q4_0 quality
  • TQ3_1S is about 1.5 GB smaller
  • on this 27B model, that size reduction is enough to change deployment on a 16 GB GPU

Important Caveat

This model card does not claim that TQ3_1S is universally faster than native Q4_0 under the same conditions.

The practical speed win in the tested setup comes mainly from:

  • TQ3_1S fitting fully on GPU
  • while Q4_0 does not fit fully on GPU on the same 16 GB card

So this is primarily a deployment / fit advantage story, not a blanket kernel-speed claim.

Files

  • Qwen3.5-27B-TQ3_1S.gguf

Base Model

Recommended Runtime

This model is intended for the public TQ3 runtime fork:

  • GitHub: https://github.com/turbo-tan/llama.cpp-tq3

It requires TQ3_1S runtime support and will not run on a stock llama.cpp build unless that support is present.

Example

./build/bin/llama-server \
  -m Qwen3.5-27B-TQ3_1S.gguf \
  -ngl 99 \
  -fa on \
  -c 4096

Quantization Notes

TQ3_1S uses a 32-element block layout:

[d0: fp16][d1: fp16][qs: 12 bytes]

That is:

  • 16 bytes per 32 weights
  • 4.0 bits per weight at the block level

Credit

This work is inspired by the broader line of transform-based quantization methods, especially RaBitQ-style Walsh-Hadamard rotation ideas, adapted here for LLM weight quantization in GGUF / llama.cpp.

Limitations

  • This 27B result does not imply that plain TQ3_1S is equally strong on smaller dense models.
  • In internal testing, 9B models were much less forgiving at this bitrate.
  • This release is a practical 27B deployment artifact, not a universal claim about all model scales.

License

Same model license terms as the base model apply.

Author: YTan2000

Likes: 3

Downloads: 0

Tags: gguf, llama.cpp, qwen, qwen3.5, quantization, turboquant, wht, text-generation, en, base_model:Qwen/Qwen3.5-27B, base_model:quantized:Qwen/Qwen3.5-27B, license:mit, endpoints_compatible, region:us, imatrix, conversational

QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ


library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3.5-27B/blob/main/LICENSE pipeline_tag: image-text-to-text tags:

  • vLLM
  • AWQ base_model:
    • Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2 base_model_relation: quantized

Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ

Base model: Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2

This repo quantizes the model using data-free quantization (no calibration dataset required).

【Dependencies / Installation】

vllm>=0.18.0
transformers>=5.3.0.dev0

As of 2026-03-30, make sure your system has cuda12.8 installed.

Then, create a fresh Python environment (e.g. python3.12 venv) and run:

pip install vllm==0.18.0

# upgrade transformers so that applications could properly execute tool calls
pip install -U "transformers @ git+https://github.com/huggingface/transformers.git@f2ba019"
# locate modeling_rope_utils.py line 651 to fix a simple bug
TF_FILE="$(python -m pip show transformers | awk -F': ' '/^Location:/{print $2}')/transformers/modeling_rope_utils.py" && echo "$TF_FILE"
NEW_LINE='            ignore_keys_at_rope_validation = set(ignore_keys_at_rope_validation) | {"partial_rotary_factor"}' \
perl -i.bak -pe 'if ($. == 651) { $_ = $ENV{NEW_LINE} . "\n" }' "$TF_FILE"

vLLM Official Guide

【vLLM Startup Command】

export OMP_NUM_THREADS=4

vllm serve \
    __YOUR_PATH__/QuantTrio/Qwen3.5-27B-AWQ \
    --served-model-name MY_MODEL \
    --max-num-seqs 32 \
    --max-model-len 32768  \
    --gpu-memory-utilization 0.9 \
    --tensor-parallel-size 2 \
    --enable-auto-tool-choice \    
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000

【Logs】

2026-03-30
1. Initial commit

【Model Files】

| File Size | Last Updated | |-----------|--------------| | 21GiB | 2026-03-30 |

【Model Download】

from huggingface_hub import snapshot_download
snapshot_download('QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ', cache_dir="your_local_path")

【Overview】

🌟 Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2

📢 Announcement

v2 Update:

  • Accuracy preserved: Matches base model on HumanEval (96.91% pass@1)

  • Shorter reasoning: ~24% reduction in chain-of-thought length

  • Higher efficiency: +31.6% more correct solutions per token

  • ⚠️Trade-off: −1.24% on HumanEval+ −7.2% on MMLU-Pro (Indicating reduced general knowledge reasoning performance)

⚠️Note: Due to the scope of SFT data and training focus, the model may underperform the base model on certain tasks requiring long-context understanding or more complex multi-step reasoning. The efficiency and accuracy results reported here are based solely on the HumanEval and HumanEval+ benchmarks. Thank you for your understanding.

HCaJnUQaoAAaMIc

💡 Model Introduction

Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2 is the second iteration of this reasoning-focused Qwen3.5-27B fine-tune, built to drastically improve the efficiency of chain-of-thought generation, unlocking highly substantial gains in reasoning speed and cost-reduction while actually increasing absolute accuracy.

Compared with the earlier version, v2 was trained with 14,000 Claude 4.6 Opus-style general reasoning samples, with a stronger emphasis on transferring concise, reusable reasoning patterns rather than only maximizing raw benchmark scores. The goal of v2 is not simply to make the model "think more," but to help it think more economically: reducing unnecessarily long internal chains, avoiding verbose over-analysis on easy problems, and massively improving the reasoning-cost-to-quality ratio while beating the baseline's benchmark correctness.

A key design choice in v2 is that the distillation data is primarily general-domain reasoning data—specifically focused on mathematics, word problems, logical deduction, and a balanced mix of general knowledge and instructions—rather than specialized code-heavy supervision. Consequently, HumanEval and HumanEval+ are employed here to evaluate cross-task generalization and capability transfer, rather than serving as direct optimization targets. High performance on these benchmarks, despite the lack of code-centric training, confirms that the model's reasoning scaffold has become more robust and transferable, proving that fundamental reasoning logic can effectively power specialized tasks like programming.

HumanEval Benchmark Analysis 🪐

The raw evaluation outputs for both models were independently cleaned, verified, and aggregated using GPT-5.4-Pro-Thinking. The final comparative results are based on these standardized and curated outputs. To ensure reliability, all results were further cross-checked and consolidated through two rounds of independent validation using Claude-4.6-Opus-Thinking.

-All evaluations were conducted in an inference environment based on Unsloth + vLLM (BF16) to ensure consistent and efficient execution conditions.

Screenshot 2026-03-20 at 3.54.26 PM

Screenshot 2026-03-20 at 3.54.51 PM

Screenshot 2026-03-20 at 3.58.30 PM

Screenshot 2026-03-20 at 3.55.49 PM

Screenshot 2026-03-20 at 3.56.01 PM

Screenshot 2026-03-20 at 3.56.18 PM

🗺️ Training Pipeline Overview

Base Model (Qwen3.5-27B)
 │
 ▼
Qwen3.5-27B fine-tuned with Unsloth
 │
 ▼
Supervised Fine-Tuning (SFT) + LoRA
(Response-Only Training masked on "<|im_start|>assistant\n<think>")
 │
 ▼
Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2

🧠 Example of Learned Reasoning Scaffold(Example)

The model includes targeted optimizations addressing Qwen3.5’s tendency toward excessive transitional or repetitive reasoning on simple queries. Through deep distillation and structural imitation of Claude-4.6-Opus reasoning chains, the model adopts a more efficient structured thinking pattern:
“Let me analyze this request carefully: 1..2..3...”.
This streamlined reasoning paradigm significantly reduces redundant cognitive loops while preserving deep analytical capacity, resulting in substantially improved inference efficiency.

Let me analyze this request carefully:

1. Identify the core objective of the problem.
2. Break the task into clearly defined subcomponents.
3. Evaluate constraints and edge cases.
4. Formulate a step-by-step solution plan.
5. Execute the reasoning sequentially and verify consistency.
            .
            .
            .

📚 All Datasets Used

The dataset consists of high-quality, filtered reasoning distillation data:

| Dataset Name | Description / Purpose | |--------------|-----------------------| | nohurry/Opus-4.6-Reasoning-3000x-filtered | Provides comprehensive Claude 4.6 Opus reasoning trajectories. | | Roman1111111/claude-opus-4.6-10000x | Large-scale public Claude 4.6 Opus distillation data used to strengthen general reasoning transfer in v2. | | TeichAI/claude-4.5-opus-high-reasoning-250x | Injecting high-intensity, structured reasoning instances. | | Jackrong/Qwen3.5-reasoning-700x | Additional curated reasoning samples designed to strengthen structured step-by-step problem solving and improve reasoning diversity. |

⚠️ Limitations & Intended Use

  • Hallucination Risk: While reasoning is strong, the model remains an autoregressive LLM; external facts provided during the thinking sequence may occasionally contain hallucinations if verifying real-world events.
  • Intended Scenario: Best suited for offline analytical tasks, coding, math, and heavy logic-dependent prompting where the user needs to transparently follow the AI's internal logic.
  • This model is a test version intended solely for learning and demonstration purposes, and is for academic research and technical exploration use only.

🙏 Acknowledgements

Significant thanks to the Unsloth AI team for making rapid fine-tuning of large LLM models accessible. Additionally, we acknowledge Qwen internally, and the open-source community developers producing exceptional distilled datasets.

📖 Citation

If you use this model in your research or projects, please cite:

@misc{jackrong_qwen35_opus_distilled,
  title        = {Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2},
  author       = {Jackrong},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2}}
}

Author: QuantTrio

Likes: 3

Downloads: 0

Tags: transformers, safetensors, qwen3_5, image-text-to-text, vLLM, AWQ, conversational, base_model:Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2, base_model:quantized:Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2, license:apache-2.0, endpoints_compatible, 4-bit, awq, region:us

stukenov/sozkz-core-omniaudio-70m-kk-asr-v1

Author: stukenov

Likes: 3

Downloads: 0

Tags: speech-recognition, asr, kazakh, audio, omniaudio, automatic-speech-recognition, kk, license:mit, model-index, region:us