Todays AI Summary

AI Developments: New OCR Model Excels in Benchmarks, Enterprise Agents Emerge, and VLMs Face Scrutiny

This week's AI landscape is marked by advancements in OCR technology, the rise of enterprise-focused agent systems, and critical evaluations of Vision-Language Models (VLMs).

Research Highlights

Several interesting research papers have been published:

  • Enterprise Deep Research (EDR): A multi-agent system designed for enterprise analytics, EDR integrates planning, specialized search agents, and a tool ecosystem to automate report generation and provide real-time insights. The system outperforms state-of-the-art agentic systems on open-ended benchmarks.
  • Executable Knowledge Graphs (xKG): This paper introduces xKG, a knowledge base that integrates technical insights and code snippets from scientific literature to improve the ability of LLM agents to replicate AI research. xKG demonstrates substantial performance gains when integrated into agent frameworks.
  • Foundational Automatic Reasoning Evaluators (FARE): This work focuses on scaling data to train evaluators for reasoning-centric domains. The resulting FARE models challenge larger specialized evaluators and set a new standard for open-source evaluators.
  • Unbiased Gradient Low-Rank Projection: This paper addresses the lack of convergence guarantees in gradient low-rank projection methods. It introduces GaLore Unbiased with Muon (GUM), a novel optimization method that matches the convergence guarantees of the base Muon algorithm while preserving memory efficiency.
  • Seeing but Not Believing: This paper investigates failures in VLMs, finding that they often perceive visual evidence even when providing incorrect answers. An inference-time intervention that highlights evidence regions improves accuracy across multiple VLM families.

Model Spotlight

  • Chandra: Datalab-to has released Chandra, an OCR model designed to output markdown, HTML, and JSON with high accuracy. Chandra excels at extracting text from images and PDFs while preserving layout information. It demonstrates strong performance in the olmocr benchmark, outperforming other models in several categories. With 13 likes, this model is gaining traction for its practical applications.
  • Sa2VA-Qwen3-VL-4B: ByteDance has released Sa2VA-Qwen3-VL-4B, an MLLM capable of question answering, visual prompt understanding, and dense object segmentation at both image and video levels. It achieves comparable performance to SOTA MLLMs Qwen2.5-VL and InternVL3 on question-answering benchmarks.

Key Takeaways

  • OCR Advancements: Chandra represents a significant step forward in OCR technology, offering improved accuracy and layout preservation for document conversion.
  • Enterprise AI Agents: EDR highlights the growing importance of multi-agent systems for enterprise analytics, enabling automated report generation and real-time insights.
  • VLM Evaluation: The "Seeing but Not Believing" paper underscores the need for careful evaluation of VLMs, revealing potential disconnects between visual attention and answer correctness.
  • Memory-Efficient Optimization: The GUM method addresses the challenge of memory-efficient optimization for large language models, offering a promising approach for unbiased low-rank optimization.

AI Papers for 2026-03-07

RoboPocket: Improve Robot Policies Instantly with Your Phone

Scaling imitation learning is fundamentally constrained by the efficiency of data collection. While handheld interfaces have emerged as a scalable solution for in-the-wild data acquisition, they predominantly operate in an open-loop manner: operators blindly collect demonstrations without knowing the underlying policy's weaknesses, leading to inefficient coverage of critical state distributions. Conversely, interactive methods like DAgger effectively address covariate shift but rely on physical robot execution, which is costly and difficult to scale. To reconcile this trade-off, we introduce RoboPocket, a portable system that enables Robot-Free Instant Policy Iteration using single consumer smartphones. Its core innovation is a Remote Inference framework that visualizes the policy's predicted trajectory via Augmented Reality (AR) Visual Foresight. This immersive feedback allows collectors to proactively identify potential failures and focus data collection on the policy's weak regions without requiring a physical robot. Furthermore, we implement an asynchronous Online Finetuning pipeline that continuously updates the policy with incoming data, effectively closing the learning loop in minutes. Extensive experiments demonstrate that RoboPocket adheres to data scaling laws and doubles the data efficiency compared to offline scaling strategies, overcoming their long-standing efficiency bottleneck. Moreover, our instant iteration loop also boosts sample efficiency by up to 2$\times$ in distributed environments a small number of interactive corrections per person. Project page and videos: https://robo-pocket.github.io.

POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation

Efficient and stable training of large language models (LLMs) remains a core challenge in modern machine learning systems. To address this challenge, Reparameterized Orthogonal Equivalence Training (POET), a spectrum-preserving framework that optimizes each weight matrix through orthogonal equivalence transformation, has been proposed. Although POET provides strong training stability, its original implementation incurs high memory consumption and computational overhead due to intensive matrix multiplications. To overcome these limitations, we introduce POET-X, a scalable and memory-efficient variant that performs orthogonal equivalence transformations with significantly reduced computational cost. POET-X maintains the generalization and stability benefits of POET while achieving substantial improvements in throughput and memory efficiency. In our experiments, POET-X enables the pretraining of billion-parameter LLMs on a single Nvidia H100 GPU, and in contrast, standard optimizers such as AdamW run out of memory under the same settings.

The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks

We study two recurring phenomena in Transformer language models: massive activations, in which a small number of tokens exhibit extreme outliers in a few channels, and attention sinks, in which certain tokens attract disproportionate attention mass regardless of semantic relevance. Prior work observes that these phenomena frequently co-occur and often involve the same tokens, but their functional roles and causal relationship remain unclear. Through systematic experiments, we show that the co-occurrence is largely an architectural artifact of modern Transformer design, and that the two phenomena serve related but distinct functions. Massive activations operate globally: they induce near-constant hidden representations that persist across layers, effectively functioning as implicit parameters of the model. Attention sinks operate locally: they modulate attention outputs across heads and bias individual heads toward short-range dependencies. We identify the pre-norm configuration as the key choice that enables the co-occurrence, and show that ablating it causes the two phenomena to decouple.

Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

Large language models sometimes produce false or misleading responses. Two approaches to this problem are honesty elicitation -- modifying prompts or weights so that the model answers truthfully -- and lie detection -- classifying whether a given response is false. Prior work evaluates such methods on models specifically trained to lie or conceal information, but these artificial constructions may not resemble naturally-occurring dishonesty. We instead study open-weights LLMs from Chinese developers, which are trained to censor politically sensitive topics: Qwen3 models frequently produce falsehoods about subjects like Falun Gong or the Tiananmen protests while occasionally answering correctly, indicating they possess knowledge they are trained to suppress. Using this as a testbed, we evaluate a suite of elicitation and lie detection techniques. For honesty elicitation, sampling without a chat template, few-shot prompting, and fine-tuning on generic honesty data most reliably increase truthful responses. For lie detection, prompting the censored model to classify its own responses performs near an uncensored-model upper bound, and linear probes trained on unrelated data offer a cheaper alternative. The strongest honesty elicitation techniques also transfer to frontier open-weights models including DeepSeek R1. Notably, no technique fully eliminates false responses. We release all prompts, code, and transcripts.

Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought

We provide evidence of performative chain-of-thought (CoT) in reasoning models, where a model becomes strongly confident in its final answer, but continues generating tokens without revealing its internal belief. Our analysis compares activation probing, early forced answering, and a CoT monitor across two large models (DeepSeek-R1 671B & GPT-OSS 120B) and find task difficulty-specific differences: The model's final answer is decodable from activations far earlier in CoT than a monitor is able to say, especially for easy recall-based MMLU questions. We contrast this with genuine reasoning in difficult multihop GPQA-Diamond questions. Despite this, inflection points (e.g., backtracking, 'aha' moments) occur almost exclusively in responses where probes show large belief shifts, suggesting these behaviors track genuine uncertainty rather than learned "reasoning theater." Finally, probe-guided early exit reduces tokens by up to 80% on MMLU and 30% on GPQA-Diamond with similar accuracy, positioning attention probing as an efficient tool for detecting performative reasoning and enabling adaptive computation.

Towards Provably Unbiased LLM Judges via Bias-Bounded Evaluation

As AI models progress beyond simple chatbots into more complex workflows, we draw ever closer to the event horizon beyond which AI systems will be utilized in autonomous, self-maintaining feedback loops. Any autonomous AI system will depend on automated, verifiable rewards and feedback; in settings where ground truth is sparse or non-deterministic, one practical source of such rewards is an LLM-as-a-Judge. Although LLM judges continue to improve, the literature has yet to introduce systems capable of enforcing standards with strong guarantees, particularly when bias vectors are unknown or adversarially discovered. To remedy this issue, we propose average bias-boundedness (A-BB), an algorithmic framework which formally guarantees reductions of harm/impact as a result of any measurable bias in an LLM judge. Evaluating on Arena-Hard-Auto with four LLM judges, we achieve (tau=0.5, delta=0.01) bias-bounded guarantees while retaining 61-99% correlation with original rankings across formatting and schematic bias settings, with most judge-bias combinations exceeding 80%. The code to reproduce our findings is available at https://github.com/penfever/bias-bounded-evaluation.

SurvHTE-Bench: A Benchmark for Heterogeneous Treatment Effect Estimation in Survival Analysis

Estimating heterogeneous treatment effects (HTEs) from right-censored survival data is critical in high-stakes applications such as precision medicine and individualized policy-making. Yet, the survival analysis setting poses unique challenges for HTE estimation due to censoring, unobserved counterfactuals, and complex identification assumptions. Despite recent advances, from Causal Survival Forests to survival meta-learners and outcome imputation approaches, evaluation practices remain fragmented and inconsistent. We introduce SurvHTE-Bench, the first comprehensive benchmark for HTE estimation with censored outcomes. The benchmark spans (i) a modular suite of synthetic datasets with known ground truth, systematically varying causal assumptions and survival dynamics, (ii) semi-synthetic datasets that pair real-world covariates with simulated treatments and outcomes, and (iii) real-world datasets from a twin study (with known ground truth) and from an HIV clinical trial. Across synthetic, semi-synthetic, and real-world settings, we provide the first rigorous comparison of survival HTE methods under diverse conditions and realistic assumption violations. SurvHTE-Bench establishes a foundation for fair, reproducible, and extensible evaluation of causal survival methods. The data and code of our benchmark are available at: https://github.com/Shahriarnz14/SurvHTE-Bench .

Leveraging LLM Parametric Knowledge for Fact Checking without Retrieval

Trustworthiness is a core research challenge for agentic AI systems built on Large Language Models (LLMs). To enhance trust, natural language claims from diverse sources, including human-written text, web content, and model outputs, are commonly checked for factuality by retrieving external knowledge and using an LLM to verify the faithfulness of claims to the retrieved evidence. As a result, such methods are constrained by retrieval errors and external data availability, while leaving the models intrinsic fact-verification capabilities largely unused. We propose the task of fact-checking without retrieval, focusing on the verification of arbitrary natural language claims, independent of their source. To study this setting, we introduce a comprehensive evaluation framework focused on generalization, testing robustness to (i) long-tail knowledge, (ii) variation in claim sources, (iii) multilinguality, and (iv) long-form generation. Across 9 datasets, 18 methods and 3 models, our experiments indicate that logit-based approaches often underperform compared to those that leverage internal model representations. Building on this finding, we introduce INTRA, a method that exploits interactions between internal representations and achieves state-of-the-art performance with strong generalization. More broadly, our work establishes fact-checking without retrieval as a promising research direction that can complement retrieval-based frameworks, improve scalability, and enable the use of such systems as reward signals during training or as components integrated into the generation process.

Distributed Partial Information Puzzles: Examining Common Ground Construction Under Epistemic Asymmetry

Establishing common ground, a shared set of beliefs and mutually recognized facts, is fundamental to collaboration, yet remains a challenge for current AI systems, especially in multimodal, multiparty settings, where the collaborators bring different information to the table. We introduce the Distributed Partial Information Puzzle (DPIP), a collaborative construction task that elicits rich multimodal communication under epistemic asymmetry. We present a multimodal dataset of these interactions, annotated and temporally aligned across speech, gesture, and action modalities to support reasoning over propositional content and belief dynamics. We then evaluate two paradigms for modeling common ground (CG): (1) state-of-the-art large language models (LLMs), prompted to infer shared beliefs from multimodal updates, and (2) an axiomatic pipeline grounded in Dynamic Epistemic Logic (DEL) that incrementally performs the same task. Results on the annotated DPIP data indicate that it poses a challenge to modern LLMs' abilities to track both task progression and belief state.

RealWonder: Real-Time Physical Action-Conditioned Video Generation

Current video generation models cannot simulate physical consequences of 3D actions like forces and robotic manipulations, as they lack structural understanding of how actions affect 3D scenes. We present RealWonder, the first real-time system for action-conditioned video generation from a single image. Our key insight is using physics simulation as an intermediate bridge: instead of directly encoding continuous actions, we translate them through physics simulation into visual representations (optical flow and RGB) that video models can process. RealWonder integrates three components: 3D reconstruction from single images, physics simulation, and a distilled video generator requiring only 4 diffusion steps. Our system achieves 13.2 FPS at 480x832 resolution, enabling interactive exploration of forces, robot actions, and camera controls on rigid objects, deformable bodies, fluids, and granular materials. We envision RealWonder opens new opportunities to apply video models in immersive experiences, AR/VR, and robot learning. Our code and model weights are publicly available in our project website: https://liuwei283.github.io/RealWonder/

AI Models

mradermacher/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-i1-GGUF


base_model: Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled language:

  • en
  • zh library_name: transformers license: apache-2.0 mradermacher: readme_rev: 1 quantized_by: mradermacher tags:
  • unsloth
  • qwen
  • qwen3.5
  • reasoning
  • chain-of-thought
  • Dense

About

<!-- ### quantize_version: 2 --> <!-- ### output_tensor_quantised: 1 --> <!-- ### convert_type: hf --> <!-- ### vocab_type: --> <!-- ### tags: nicoboss --> <!-- ### quants: Q2_K IQ3_M Q4_K_S IQ3_XXS Q3_K_M small-IQ4_NL Q4_K_M IQ2_M Q6_K IQ4_XS Q2_K_S IQ1_M Q3_K_S IQ2_XXS Q3_K_L IQ2_XS Q5_K_S IQ2_S IQ1_S Q5_K_M Q4_0 IQ3_XS Q4_1 IQ3_S --> <!-- ### quants_skip: --> <!-- ### skip_mmproj: -->

weighted/imatrix quants of https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled

<!-- provided-files -->

For a convenient overview and download list, visit our model page for this model.

static quants are available at https://huggingface.co/mradermacher/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF

This is a vision model - mmproj files (if any) will be in the static repository.

Usage

If you are unsure how to use GGUF files, refer to one of TheBloke's READMEs for more details, including on how to concatenate multi-part files.

Provided Quants

(sorted by size, not necessarily quality. IQ-quants are often preferable over similar sized non-IQ quants)

| Link | Type | Size/GB | Notes | |:-----|:-----|--------:|:------| | GGUF | imatrix | 0.1 | imatrix file (for creating your own quants) | | GGUF | i1-IQ1_S | 6.3 | for the desperate | | GGUF | i1-IQ1_M | 6.9 | mostly desperate | | GGUF | i1-IQ2_XXS | 7.8 | | | GGUF | i1-IQ2_XS | 8.5 | | | GGUF | i1-IQ2_S | 8.8 | | | GGUF | i1-IQ2_M | 9.5 | | | GGUF | i1-Q2_K_S | 9.8 | very low quality | | GGUF | i1-Q2_K | 10.2 | IQ3_XXS probably better | | GGUF | i1-IQ3_XXS | 10.8 | lower quality | | GGUF | i1-IQ3_XS | 11.7 | | | GGUF | i1-Q3_K_S | 12.2 | IQ3_XS probably better | | GGUF | i1-IQ3_S | 12.2 | beats Q3_K* | | GGUF | i1-IQ3_M | 12.7 | | | GGUF | i1-Q3_K_M | 13.4 | IQ3_S probably better | | GGUF | i1-Q3_K_L | 14.1 | IQ3_M probably better | | GGUF | i1-IQ4_XS | 14.8 | | | GGUF | i1-Q4_0 | 15.6 | fast, low quality | | GGUF | i1-Q4_K_S | 15.7 | optimal size/speed/quality | | GGUF | i1-Q4_K_M | 16.6 | fast, recommended | | GGUF | i1-Q4_1 | 17.2 | | | GGUF | i1-Q5_K_S | 18.8 | | | GGUF | i1-Q5_K_M | 19.5 | | | GGUF | i1-Q6_K | 22.2 | practically like static Q6_K |

Here is a handy graph by ikawrakow comparing some lower-quality quant types (lower is better):

image.png

And here are Artefact2's thoughts on the matter: https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9

FAQ / Model Request

See https://huggingface.co/mradermacher/model_requests for some answers to questions you might have and/or if you want some other model quantized.

Thanks

I thank my company, nethype GmbH, for letting me use its servers and providing upgrades to my workstation to enable this work in my free time. Additional thanks to @nicoboss for giving me access to his private supercomputer, enabling me to provide many more imatrix quants, at much higher quality, than I would otherwise be able to.

<!-- end -->

Author: mradermacher

Likes: 3

Downloads: 0

Tags: transformers, gguf, unsloth, qwen, qwen3.5, reasoning, chain-of-thought, Dense, en, zh, base_model:Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled, base_model:quantized:Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled, license:apache-2.0, endpoints_compatible, region:us, imatrix, conversational

bowang0911/pplx-embed-v1-0.6b-gguf


library_name: llama-cpp base_model: perplexity-ai/pplx-embed-v1-0.6b tags:

  • embedding
  • gguf
  • qwen3
  • bidirectional license: apache-2.0

pplx-embed-v1-0.6b GGUF (F16)

GGUF conversion of perplexity-ai/pplx-embed-v1-0.6b.

| File | Quant | Size | |---|---|---| | pplx-embed-v1-0.6b-f16.gguf | F16 | 1.2 GB |

Usage

This model requires non-causal (bidirectional) attention. Without it, outputs will be incorrect.

from llama_cpp import Llama, llama_cpp

llm = Llama(model_path="pplx-embed-v1-0.6b-f16.gguf", embedding=True, pooling_type=1)
llama_cpp.llama_set_causal_attn(llm._ctx.ctx, False)

raw = llm.embed("your text here")

Note: The GGUF outputs raw float embeddings only. The original model natively produces int8/binary quantized embeddings via a post-processing step (st_quantize.FlexibleQuantizer). To match that behavior, apply the quantization manually:

import numpy as np

# Int8: tanh → scale → round → clamp (matches Int8TanhQuantizer)
int8_emb = np.clip(np.round(np.tanh(raw) * 127), -128, 127).astype(np.int8)

# Binary: sign (matches BinaryTanhQuantizer)
binary_emb = np.where(np.array(raw) >= 0, 1, -1).astype(np.int8)

CLI:

llama-embedding -m pplx-embed-v1-0.6b-f16.gguf --attention non-causal --pooling mean -p "your text here"

Verification

Cosine similarity vs original: > 0.99999 across all test inputs. Residual diffs are from F32 → F16 precision.

Author: bowang0911

Likes: 3

Downloads: 0

Tags: llama-cpp, gguf, embedding, qwen3, bidirectional, base_model:perplexity-ai/pplx-embed-v1-0.6b, base_model:quantized:perplexity-ai/pplx-embed-v1-0.6b, license:apache-2.0, endpoints_compatible, region:us

cruciverb-it/crossword-space-mpnet-base-ade


library_name: transformers license: cc-by-4.0 language:

  • it tags:
  • crossword
  • dual-encoder
  • contrastive-learning
  • information-retrieval
  • sentence-similarity
  • italian datasets:
  • cruciverb-it/evalita2026 base_model:
  • sentence-transformers/paraphrase-multilingual-mpnet-base-v2 pipeline_tag: feature-extraction model-index:
  • name: crossword-space-mpnet-base-ade results:
    • task: type: retrieval name: Crossword Clue Answering dataset: type: cruciverb-it/evalita2026 name: Crossword Test metrics:
      • type: accuracy value: 54.0 name: Acc@1 (length-filtered)
      • type: mrr value: 63.6 name: MRR (length-filtered)
    • task: type: retrieval name: Dictionary Definition Matching dataset: type: cruciverb-it/evalita2026 name: Dictionary Test metrics:
      • type: accuracy value: 35.6 name: Acc@1 (length-filtered)
      • type: mrr value: 45.0 name: MRR (length-filtered)

Crossword Space: MPNet-base ADE

An Asymmetric Dual Encoder (ADE) for Italian crossword clue answering, trained with contrastive learning on clue-answer pairs.

This model projects crossword clues and candidate answers into a shared 768-dimensional latent space, enabling efficient retrieval via FAISS inner-product search.

Model Description

  • Architecture: Asymmetric Dual Encoder with two separate XLM-RoBERTa encoders (one for clues, one for answers), a shared LayerNorm, and a shared linear projection head.
  • Base encoder: sentence-transformers/paraphrase-multilingual-mpnet-base-v2 (278M parameters per tower)
  • Training objective: Symmetric contrastive loss (InfoNCE) with in-batch hard negative mining and learnable temperature
  • Embedding dimension: 768
  • Training data: Italian crossword clues and dictionary definitions paired with their answers/defined words

For more details, see the paper: Crossword Space: Latent Manifold Learning for Italian Crosswords and Beyond (CLiC-it 2025). This model corresponds to our best configuration, which includes dictionary items during training.

Usage

Loading the model

from transformers import AutoModel

model = AutoModel.from_pretrained(
    "cruciverb-it/crossword-space-mpnet-base-ade",
    trust_remote_code=True,
)

Or, if you have the crossword-space repository cloned locally:

from model import DualEncoderADE

model = DualEncoderADE.from_pretrained("cruciverb-it/crossword-space-mpnet-base-ade")

Full example

import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained(
    "cruciverb-it/crossword-space-mpnet-base-ade",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("cruciverb-it/crossword-space-mpnet-base-ade")
model.eval()

clues = ["Capitale d'Italia", "Fiume che attraversa Roma"]
answers = ["ROMA", "TEVERE"]

clue_enc = tokenizer(clues, padding=True, truncation=True, max_length=64, return_tensors="pt")
ans_enc = tokenizer(answers, padding=True, truncation=True, max_length=16, return_tensors="pt")

with torch.no_grad():
    clue_emb, ans_emb = model(
        def_input_ids=clue_enc["input_ids"],
        def_attention_mask=clue_enc["attention_mask"],
        ans_input_ids=ans_enc["input_ids"],
        ans_attention_mask=ans_enc["attention_mask"],
    )

# L2-normalize for cosine similarity / inner-product search
clue_emb = F.normalize(clue_emb, dim=-1)
ans_emb = F.normalize(ans_emb, dim=-1)

# Similarity matrix
similarity = clue_emb @ ans_emb.T
print(similarity)

Evaluation Results

Retrieval performance on four Italian test sets. Length-filtered metrics restrict the FAISS index to candidates matching the expected answer length, which is the standard setting for crossword solving.

Full Index

| Test Set | Acc@1 | Acc@10 | Acc@100 | Acc@1000 | MRR | |---|---|---|---|---|---| | Crossword | 31.8 | 63.4 | 81.3 | 91.0 | 42.7 | | Dictionary | 17.5 | 40.5 | 63.3 | 82.2 | 25.3 | | ONLI | 11.8 | 34.9 | 61.0 | 81.5 | 19.7 | | Neologisms | 9.0 | 25.0 | 59.0 | 82.0 | 14.6 |

Length-filtered Index

| Test Set | Acc@1 | Acc@10 | Acc@100 | Acc@1000 | MRR | |---|---|---|---|---|---| | Crossword | 54.0 | 80.2 | 91.3 | 97.2 | 63.6 | | Dictionary | 35.6 | 62.6 | 82.5 | 95.7 | 45.0 | | ONLI | 36.0 | 65.0 | 84.3 | 95.8 | 45.8 | | Neologisms | 28.0 | 60.0 | 84.0 | 94.0 | 38.1 |

Citation

@inproceedings{ciaccio-etal-2025-crossword-space,
    title = "Crossword Space: Latent Manifold Learning for Italian Crosswords and Beyond",
    author = "Ciaccio, Cristiano and Sarti, Gabriele and Miaschi, Alessio and Dell'Orletta, Felice",
    booktitle = "Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025)",
    year = "2025",
    url = "https://aclanthology.org/2025.clicit-1.26/"
}

License

CC BY 4.0

Author: cruciverb-it

Likes: 3

Downloads: 17

Tags: transformers, safetensors, dual_encoder_ade, feature-extraction, crossword, dual-encoder, contrastive-learning, information-retrieval, sentence-similarity, italian, custom_code, it, dataset:cruciverb-it/evalita2026, base_model:sentence-transformers/paraphrase-multilingual-mpnet-base-v2, base_model:finetune:sentence-transformers/paraphrase-multilingual-mpnet-base-v2, license:cc-by-4.0, model-index, region:us

arcee-ai/Trinity-Mini-W4A16


license: apache-2.0 language:

  • en
  • es
  • fr
  • de
  • it
  • pt
  • ru
  • ar
  • hi
  • ko
  • zh library_name: transformers base_model:
  • arcee-ai/Trinity-Mini base_model_relation: quantized

<div align="center"> <picture> <img src="https://cdn-uploads.huggingface.co/production/uploads/6435718aaaef013d1aec3b8b/i-v1KyAMOW_mgVGeic9WJ.png" alt="Arcee Trinity Mini" style="max-width: 100%; height: auto;" > </picture> </div>

Trinity Mini W4A16

This repository contains the W4A16 quantized weights of Trinity-Mini (INT4 weights, 16-bit activations).

Trinity Mini is an Arcee AI 26B MoE model with 3B active parameters. It is the medium-sized model in our new Trinity family, a series of open-weight models for enterprise and tinkerers alike.

This model is tuned for reasoning, but in testing, it uses a similar total token count to competitive instruction-tuned models.


Trinity Mini is trained on 10T tokens gathered and curated through a key partnership with Datology, building upon the excellent dataset we used on AFM-4.5B with additional math and code.

Training was performed on a cluster of 512 H200 GPUs powered by Prime Intellect using HSDP parallelism.

More details, including key architecture decisions, can be found on our blog here

Try it out now at chat.arcee.ai


Model Details

  • Model Architecture: AfmoeForCausalLM
  • Parameters: 26B, 3B active
  • Experts: 128 total, 8 active, 1 shared
  • Context length: 128k
  • Training Tokens: 10T
  • License: Apache 2.0
  • Recommended settings:
    • temperature: 0.15
    • top_k: 50
    • top_p: 0.75
    • min_p: 0.06

Quantization Details

  • Scheme: W4A16 (INT4 weights, 16-bit activations)
  • Intended use: quality-preserving 4-bit deployment of Trinity-Mini

Benchmarks

<div align="center"> <picture> <img src="https://cdn-uploads.huggingface.co/production/uploads/6435718aaaef013d1aec3b8b/sSVjGNHfrJKmQ6w8I18ek.png" style="background-color:ghostwhite;padding:5px;" width="17%" alt="Powered by Datology"> </picture> </div>

Running our model

Transformers

Use the main transformers branch

git clone https://github.com/huggingface/transformers.git
cd transformers

# pip
pip install '.[torch]'

# uv
uv pip install '.[torch]'
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "arcee-ai/Trinity-Mini"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [
    {"role": "user", "content": "Who are you?"},
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    input_ids,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.5,
    top_k=50,
    top_p=0.95
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

If using a released transformers, simply pass "trust_remote_code=True":

model_id = "arcee-ai/Trinity-Mini"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

VLLM

Supported in VLLM release 0.11.1

# pip
pip install "vllm>=0.11.1"

Serving the model with suggested settings:

vllm serve arcee-ai/Trinity-Mini \
  --dtype bfloat16 \
  --enable-auto-tool-choice \
  --reasoning-parser deepseek_r1 \
  --tool-call-parser hermes

llama.cpp

Supported in llama.cpp release b7061

Download the latest llama.cpp release

llama-server -hf arcee-ai/Trinity-Mini-GGUF:q4_k_m \
  --temp 0.15 \
  --top-k 50 \
  --top-p 0.75
  --min-p 0.06

LM Studio

Supported in latest LM Studio runtime

Update to latest available, then verify your runtime by:

  1. Click "Power User" at the bottom left
  2. Click the green "Developer" icon at the top left
  3. Select "LM Runtimes" at the top
  4. Refresh the list of runtimes and verify that the latest is installed

Then, go to Model Search and search for arcee-ai/Trinity-Mini-GGUF, download your prefered size, and load it up in the chat

API

Trinity Mini is available today on openrouter:

https://openrouter.ai/arcee-ai/trinity-mini

curl -X POST "https://openrouter.ai/v1/chat/completions" \
  -H "Authorization: Bearer $OPENROUTER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "arcee-ai/trinity-mini",
    "messages": [
      {
        "role": "user",
        "content": "What are some fun things to do in New York?"
      }
    ]
  }'

License

Trinity-Mini-W4A16 is released under the Apache-2.0 license.

Author: arcee-ai

Likes: 2

Downloads: 0

Tags: transformers, safetensors, afmoe, text-generation, conversational, custom_code, en, es, fr, de, it, pt, ru, ar, hi, ko, zh, base_model:arcee-ai/Trinity-Mini, base_model:quantized:arcee-ai/Trinity-Mini, license:apache-2.0, endpoints_compatible, compressed-tensors, region:us

jamesburton/Phi-4-reasoning-vision-15B-GGUF


license: mit language:

  • en base_model: microsoft/Phi-4-reasoning-vision-15B tags:
  • phi4
  • phi-4
  • gguf
  • quantized
  • llama-cpp
  • ollama
  • text-generation
  • reasoning model_type: phi3 quantized_by: jamesburton pipeline_tag: text-generation

Phi-4-reasoning-vision-15B-GGUF

GGUF format conversions of microsoft/Phi-4-reasoning-vision-15B for use with llama.cpp and Ollama.

Note: This conversion includes the text backbone only (language model weights). Vision encoder and multimodal projector weights are excluded, as llama.cpp does not yet support the phi4-siglip vision architecture. The text model is architecturally identical to Phi-4-reasoning-plus (Phi3ForCausalLM).

Available Files

| Filename | Quant Type | Size | Description | |---|---|---|---| | phi-4-reasoning-vision-f16.gguf | F16 | ~28 GB | Full precision (float16) | | phi-4-reasoning-vision-q8_0.gguf | Q8_0 | ~15 GB | 8-bit quantization (near-lossless) | | phi-4-reasoning-vision-q6_k.gguf | Q6_K | ~12 GB | 6-bit K-quant | | phi-4-reasoning-vision-q5_k_m.gguf | Q5_K_M | ~9.9 GB | 5-bit K-quant medium | | phi-4-reasoning-vision-q5_k_s.gguf | Q5_K_S | ~9.5 GB | 5-bit K-quant small | | phi-4-reasoning-vision-q4_K_M.gguf | Q4_K_M | ~8.5 GB | 4-bit K-quant medium (recommended) | | phi-4-reasoning-vision-q4_k_s.gguf | Q4_K_S | ~7.9 GB | 4-bit K-quant small | | phi-4-reasoning-vision-q3_k_l.gguf | Q3_K_L | ~7.4 GB | 3-bit K-quant large | | phi-4-reasoning-vision-q3_k_m.gguf | Q3_K_M | ~6.9 GB | 3-bit K-quant medium | | phi-4-reasoning-vision-q3_k_s.gguf | Q3_K_S | ~6.1 GB | 3-bit K-quant small | | phi-4-reasoning-vision-q2_k.gguf | Q2_K | ~5.2 GB | 2-bit K-quant (smallest, lowest quality) |

How to Use

With Ollama

# Download the Q4_K_M GGUF and create a Modelfile:
cat > Modelfile <<'EOF'
FROM ./phi-4-reasoning-vision-q4_K_M.gguf

TEMPLATE """<|system|>
{{ if .System }}{{ .System }}{{ else }}You are a helpful AI assistant with vision capabilities. You can analyze images and reason about them step by step.{{ end }}<|end|>
<|user|>
{{ .Prompt }}<|end|>
<|assistant|>
"""

PARAMETER stop "<|end|>"
PARAMETER stop "<|endoftext|>"
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 4096
EOF

ollama create phi4-vision -f Modelfile
ollama run phi4-vision

With llama.cpp

./llama-cli -m phi-4-reasoning-vision-q4_K_M.gguf -p "Explain the theory of relativity in simple terms." -n 512

Model Details

  • Original Model: microsoft/Phi-4-reasoning-vision-15B
  • Architecture: Phi3ForCausalLM (text backbone of Phi-4-reasoning-vision)
  • Parameters: ~15B (text model)
  • Hidden Size: 5120
  • Layers: 40
  • Attention Heads: 40 (10 KV heads, GQA)
  • Vocab Size: 100,352
  • Tokenizer: GPT-2 (BPE)
  • Context Length: Up to 131,072 tokens (with RoPE scaling)
  • License: MIT

Conversion Details

  • Converted using llama.cpp convert_hf_to_gguf.py
  • Vision tower (model.vision_tower.*) and multimodal projector (model.mm_projector.*) weights were skipped during conversion
  • The model config was remapped from Phi4ForCausalLMV (phi4-siglip) to Phi3ForCausalLM (phi3) since the text backbone is architecturally identical
  • Quantization performed via llama_model_quantize() with CUDA acceleration
  • 243 text tensors converted, 452 vision tensors excluded

Original Model Card

For full details on training, capabilities, safety, and intended use, please refer to the original model card.

Disclaimer

This is an unofficial GGUF conversion. The original model was created by Microsoft Research. All credit for the model architecture, training, and capabilities belongs to the Microsoft Phi team. Please refer to the original model's license for usage terms.

Author: jamesburton

Likes: 2

Downloads: 0

Tags: gguf, phi4, phi-4, quantized, llama-cpp, ollama, text-generation, reasoning, en, base_model:microsoft/Phi-4-reasoning-vision-15B, base_model:quantized:microsoft/Phi-4-reasoning-vision-15B, license:mit, endpoints_compatible, region:us

GitMylo/LTX2.3-Quant-formats

Quant formats for LTX2.3

  • NVFP4
  • FP8_e4m3fn scaled

Author: GitMylo

Likes: 2

Downloads: 0

Tags: region:us

litert-community/Qwen3.5-2B-LiteRT


license: apache-2.0 pipeline_tag: text-generation tags:

  • litert
  • tflite
  • on-device
  • qwen
  • qwen3.5

qwen3.5-2b

This repository contains an experimental locally converted LiteRT/TFLite artifact.

  • target: qwen3.5-2b
  • artifact: qwen3-5-2b_q8_ekv128.tflite
  • size: 1835.28 MB
  • generated_at: 2026-03-06T07:41:24.537004+00:00

Author: litert-community

Likes: 2

Downloads: 0

Tags: litert, tflite, on-device, qwen, qwen3.5, text-generation, license:apache-2.0, region:us

litert-community/Qwen3.5-0.6B-LiteRT


license: apache-2.0 pipeline_tag: text-generation tags:

  • litert
  • tflite
  • on-device
  • qwen
  • qwen3.5

qwen3.5-0.6b

This repository contains an experimental locally converted LiteRT/TFLite artifact.

  • target: qwen3.5-0.6b
  • artifact: qwen3-5-0-6b_q4_block32_ekv512.tflite
  • size: 323.95 MB
  • generated_at: 2026-03-06T07:41:06.748087+00:00

Author: litert-community

Likes: 2

Downloads: 0

Tags: litert, tflite, on-device, qwen, qwen3.5, text-generation, license:apache-2.0, region:us

yuvalkansal/QwQ-Med-3

QwQ-Med-3

QwQ-Med-3 is a medical reasoning model fine-tuned from Qwen/QwQ-32B on up to three-hop reasoning paths derived from a medical Knowledge Graph. It is introduced in the paper "Bottom-up Domain-specific Superintelligence: A Reliable Knowledge Graph is What We Need" by Bhishma Dedhia, Yuval Kansal, and Niraj K. Jha.

Paper & Code

Model Description

QwQ-Med-3 is trained using an SFT pipeline:

  1. Supervised Fine-Tuning (SFT): The base QwQ-32B model is fine-tuned on question-answer pairs grounded in multi-hop medical knowledge graph paths, teaching the model to produce structured chain-of-thought reasoning aligned with KG-derived evidence.

Intended Use

  • Medical question answering requiring multi-hop reasoning
  • Evaluation on ICD-coded clinical vignettes
  • Research into knowledge graph-guided language model training

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "yuvalkansal/QwQ-Med-3",
    torch_dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("yuvalkansal/QwQ-Med-3")

prompt = "A 45-year-old patient presents with..."
messages = [{"role": "user", "content": prompt}]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Citation

If you use this model, please cite:

@misc{dedhia2025bottomupsuperintelligence,
  author = "{Dedhia, Bhishma and Kansal, Yuval and Jha, Niraj K.}",
  title = "Bottom-up Domain-specific Superintelligence: A Reliable Knowledge Graph is What We Need",
  year = "2025",
  url = {https://arxiv.org/abs/2507.13966}
  }

Author: yuvalkansal

Likes: 1

Downloads: 0

Tags: safetensors, qwen2, arxiv:2507.13966, region:us

NousResearch/moe-10b-a1b-8k-wsd-lr3e4-1t

Author: NousResearch

Likes: 1

Downloads: 0

Tags: region:us