Todays AI Summary

AI Developments: EmbeddingGemma Leads in Text Representations, RAG Security Under Scrutiny

Here's a look at the latest AI models and research papers:

Research Highlights

  • EmbeddingGemma: Powerful and Lightweight Text Representations: This paper introduces EmbeddingGemma, a new open text embedding model based on the Gemma 3 family. It uses knowledge distillation and a spread-out regularizer to achieve state-of-the-art results on the Massive Text Embedding Benchmark (MTEB) with fewer than 500M parameters. Its performance is comparable to models twice its size, making it suitable for low-latency applications.
  • RAG Security and Privacy: Formalizing the Threat Model and Attack Surface: This paper addresses the security and privacy challenges in Retrieval-Augmented Generation (RAG) systems. It proposes a formal threat model and taxonomy of adversary types, defining key threat vectors like document-level membership inference and data poisoning. This work lays the foundation for a more rigorous understanding of security in RAG systems.
  • Video models are zero-shot learners and reasoners: This paper demonstrates that Veo 3 can solve a broad variety of tasks it wasn't explicitly trained for, such as segmenting objects, detecting edges, editing images, understanding physical properties, recognizing object affordances, simulating tool use, and more.

Model Updates

  • mradermacher/gpt-oss-20b-ru-reasoner-GGUF: This model provides quantized versions of NotEvilAI/gpt-oss-20b-ru-reasoner, focusing on reasoning in Russian and English. It is based on the GPT-OSS architecture and fine-tuned on datasets like NotEvilAI/ru-reasoning_effort-sft_dpo_think_gpt.
  • khazarai's Azerbaijani Models: khazarai has released several fine-tuned models based on unsloth/Qwen3-1.7B for the Azerbaijani language:
    • AzQwen: Fine-tuned on a translated version of the Alpaca Stanford dataset for instruction following.
    • Azerbaijani-math-1.7B: Adapted for mathematical problem-solving in Azerbaijani, trained on datasets like OnlyCheeini/azerbaijani-math-gpt4o.
    • Nizami-1.7B: Fine-tuned for academic-style comprehension and reasoning in Azerbaijani, trained on the az-llm/az_academic_qa-v1.0 dataset.
  • AGofficial/AgGPT21: A GPT-style language model built with PyTorch, featuring word-level tokenization and GRU-based architecture.

Key Takeaways

  • Embedding Models Excel: EmbeddingGemma showcases the potential for lightweight models to achieve state-of-the-art performance in text representation through strategic training techniques.
  • RAG Security is Critical: The formalization of threats in RAG systems highlights the importance of addressing security and privacy concerns in these emerging architectures.
  • Azerbaijani Language Models Emerge: The release of multiple Azerbaijani language models demonstrates a growing focus on developing resources for low-resource languages.
  • Video Models as Generalist Vision Foundation Models: Veo's emergent zero-shot capabilities indicate that video models are on a path to becoming unified, generalist vision foundation models.

AI Papers for 2026-03-20

Unified Spatio-Temporal Token Scoring for Efficient Video VLMs

Token pruning is essential for enhancing the computational efficiency of vision-language models (VLMs), particularly for video-based tasks where temporal redundancy is prevalent. Prior approaches typically prune tokens either (1) within the vision transformer (ViT) exclusively for unimodal perception tasks such as action recognition and object segmentation, without adapting to downstream vision-language tasks; or (2) only within the LLM while leaving the ViT output intact, often requiring complex text-conditioned token selection mechanisms. In this paper, we introduce Spatio-Temporal Token Scoring (STTS), a simple and lightweight module that prunes vision tokens across both the ViT and the LLM without text conditioning or token merging, and is fully compatible with end-to-end training. By learning how to score temporally via an auxiliary loss and spatially via LLM downstream gradients, aided by our efficient packing algorithm, STTS prunes 50% of vision tokens throughout the entire architecture, resulting in a 62% improvement in efficiency during both training and inference with only a 0.7% drop in average performance across 13 short and long video QA tasks. Efficiency gains increase with more sampled frames per video. Applying test-time scaling for long-video QA further yields performance gains of 0.5-1% compared to the baseline. Overall, STTS represents a novel, simple yet effective technique for unified, architecture-wide vision token pruning.

Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models

Multimodal Large Language Models (MLLMs) have made impressive progress in connecting vision and language, but they still struggle with spatial understanding and viewpoint-aware reasoning. Recent efforts aim to augment the input representations with geometric cues rather than explicitly teaching models to reason in 3D space. We introduce Loc3R-VLM, a framework that equips 2D Vision-Language Models with advanced 3D understanding capabilities from monocular video input. Inspired by human spatial cognition, Loc3R-VLM relies on two joint objectives: global layout reconstruction to build a holistic representation of the scene structure, and explicit situation modeling to anchor egocentric perspective. These objectives provide direct spatial supervision that grounds both perception and language in a 3D context. To ensure geometric consistency and metric-scale alignment, we leverage lightweight camera pose priors extracted from a pre-trained 3D foundation model. Loc3R-VLM achieves state-of-the-art performance in language-based localization and outperforms existing 2D- and video-based approaches on situated and general 3D question-answering benchmarks, demonstrating that our spatial supervision framework enables strong 3D understanding. Project page: https://kevinqu7.github.io/loc3r-vlm

AgentFactory: A Self-Evolving Framework Through Executable Subagent Accumulation and Reuse

Building LLM-based agents has become increasingly important. Recent works on LLM-based agent self-evolution primarily record successful experiences as textual prompts or reflections, which cannot reliably guarantee efficient task re-execution in complex scenarios. We propose AgentFactory, a new self-evolution paradigm that preserves successful task solutions as executable subagent code rather than textual experience. Crucially, these subagents are continuously refined based on execution feedback, becoming increasingly robust and efficient as more tasks are encountered. Saved subagents are pure Python code with standardized documentation, enabling portability across any Python-capable system. We demonstrate that AgentFactory enables continuous capability accumulation: its library of executable subagents grows and improves over time, progressively reducing the effort required for similar tasks without manual intervention. Our implementation is open-sourced at https://github.com/zzatpku/AgentFactory, and our demonstration video is available at https://youtu.be/iKSsuAXJHW0.

Toward Scalable Automated Repository-Level Datasets for Software Vulnerability Detection

Software vulnerabilities continue to grow in volume and remain difficult to detect in practice. Although learning-based vulnerability detection has progressed, existing benchmarks are largely function-centric and fail to capture realistic, executable, interprocedural settings. Recent repo-level security benchmarks demonstrate the importance of realistic environments, but their manual curation limits scale. This doctoral research proposes an automated benchmark generator that injects realistic vulnerabilities into real-world repositories and synthesizes reproducible proof-of-vulnerability (PoV) exploits, enabling precisely labeled datasets for training and evaluating repo-level vulnerability detection agents. We further investigate an adversarial co-evolution loop between injection and detection agents to improve robustness under realistic constraints.

TDAD: Test-Driven Agentic Development - Reducing Code Regressions in AI Coding Agents via Graph-Based Impact Analysis

AI coding agents can resolve real-world software issues, yet they frequently introduce regressions, breaking tests that previously passed. Current benchmarks focus almost exclusively on resolution rate, leaving regression behavior under-studied. This paper presents TDAD (Test-Driven Agentic Development), an open-source tool and benchmark methodology that combines abstract-syntax-tree (AST) based code-test graph construction with weighted impact analysis to surface the tests most likely affected by a proposed change. Evaluated on SWE-bench Verified with two local models (Qwen3-Coder 30B on 100 instances and Qwen3.5-35B-A3B on 25 instances), TDAD's GraphRAG workflow reduced test-level regressions by 70% (6.08% to 1.82%) and improved resolution from 24% to 32% when deployed as an agent skill. A surprising finding is that TDD prompting alone increased regressions (9.94%), revealing that smaller models benefit more from contextual information (which tests to verify) than from procedural instructions (how to do TDD). An autonomous auto-improvement loop raised resolution from 12% to 60% on a 10-instance subset with 0% regression. These findings suggest that for AI agent tool design, surfacing contextual information outperforms prescribing procedural workflows. All code, data, and logs are publicly available at https://github.com/pepealonso95/TDAD.

Specification-Aware Distribution Shaping for Robotics Foundation Models

Robotics foundation models have demonstrated strong capabilities in executing natural language instructions across diverse tasks and environments. However, they remain largely data-driven and lack formal guarantees on safety and satisfaction of time-dependent specifications during deployment. In practice, robots often need to comply with operational constraints involving rich spatio-temporal requirements such as time-bounded goal visits, sequential objectives, and persistent safety conditions. In this work, we propose a specification-aware action distribution optimization framework that enforces a broad class of Signal Temporal Logic (STL) constraints during execution of a pretrained robotics foundation model without modifying its parameters. At each decision step, the method computes a minimally modified action distribution that satisfies a hard STL feasibility constraint by reasoning over the remaining horizon using forward dynamics propagation. We validate the proposed framework in simulation using a state-of-the-art robotics foundation model across multiple environments and complex specifications.

VideoAtlas: Navigating Long-Form Video in Logarithmic Compute

Extending language models to video introduces two challenges: representation, where existing methods rely on lossy approximations, and long-context, where caption- or agent-based pipelines collapse video into text and lose visual fidelity. To overcome this, we introduce \textbf{VideoAtlas}, a task-agnostic environment to represent video as a hierarchical grid that is simultaneously lossless, navigable, scalable, caption- and preprocessing-free. An overview of the video is available at a glance, and any region can be recursively zoomed into, with the same visual representation used uniformly for the video, intermediate investigations, and the agent's memory, eliminating lossy text conversion end-to-end. This hierarchical structure ensures access depth grows only logarithmically with video length. For long-context, Recursive Language Models (RLMs) recently offered a powerful solution for long text, but extending them to visual domain requires a structured environment to recurse into, which \textbf{VideoAtlas} provides. \textbf{VideoAtlas} as a Markov Decision Process unlocks Video-RLM: a parallel Master-Worker architecture where a Master coordinates global exploration while Workers concurrently drill into assigned regions to accumulate lossless visual evidence. We demonstrate three key findings: (1)~logarithmic compute growth with video duration, further amplified by a 30-60\% multimodal cache hit rate arising from the grid's structural reuse. (2)~environment budgeting, where bounding the maximum exploration depth provides a principled compute-accuracy hyperparameter. (3)~emergent adaptive compute allocation that scales with question granularity. When scaling from 1-hour to 10-hour benchmarks, Video-RLM remains the most duration-robust method with minimal accuracy degradation, demonstrating that structured environment navigation is a viable and scalable paradigm for video understanding.

CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention

Converting pretrained attention modules such as grouped-query attention (GQA) into multi-head latent attention (MLA) can improve expressivity without increasing KV-cache cost, making it attractive for efficient inference. However, many practical conversion baselines rely on weight-only low-rank approximations (e.g., SVD-style initializations) and uniform rank allocation. They focus on minimizing the difference between weight matrices rather than on how those weights affect input activations, ignore the covariance structure of activations, and enforce uniform rank across layers, causing activation drift and degraded attention fidelity. To address these issues, we propose CARE, a Covariance-Aware, Rank-Enhanced MLA conversion pipeline under a fixed KV width. CARE introduces three key steps: (i) activation-preserving factorization, which aligns the approximation with the actual input activations rather than just the weights; (ii) adjusted-rank allocation, which spreads a fixed KV budget across layers by giving more capacity to layers that need it most; and (iii) KV-parity mapping, which reparameterizes the converted K and V to fit the MLA format while keeping the KV-cache size unchanged. Our method outperforms a uniform-rank SVD baseline on Qwen3-4B/30B-A3B-Instruct-2507 and Llama-3.1-8B/70B-Instruct, reducing one-shot perplexity by up to 215x and improving mean accuracy by up to 1.70x at matched KV budgets. With a brief post-SVD healing fine-tune, we fully recover the original model's accuracy.

IndicSafe: A Benchmark for Evaluating Multilingual LLM Safety in South Asia

As large language models (LLMs) are deployed in multilingual settings, their safety behavior in culturally diverse, low-resource languages remains poorly understood. We present the first systematic evaluation of LLM safety across 12 Indic languages, spoken by over 1.2 billion people but underrepresented in LLM training data. Using a dataset of 6,000 culturally grounded prompts spanning caste, religion, gender, health, and politics, we assess 10 leading LLMs on translated variants of the prompt. Our analysis reveals significant safety drift: cross-language agreement is just 12.8\%, and \texttt{SAFE} rate variance exceeds 17\% across languages. Some models over-refuse benign prompts in low-resource scripts, overflag politically sensitive topics, while others fail to flag unsafe generations. We quantify these failures using prompt-level entropy, category bias scores, and multilingual consistency indices. Our findings highlight critical safety generalization gaps in multilingual LLMs and show that safety alignment does not transfer evenly across languages. We release \textsc{IndicSafe}, the first benchmark to enable culturally informed safety evaluation for Indic deployments, and advocate for language-aware alignment strategies grounded in regional harms.

Differential Privacy in Generative AI Agents: Analysis and Optimal Tradeoffs

Large language models (LLMs) and AI agents are increasingly integrated into enterprise systems to access internal databases and generate context-aware responses. While such integration improves productivity and decision support, the model outputs may inadvertently reveal sensitive information. Although many prior efforts focus on protecting the privacy of user prompts, relatively few studies consider privacy risks from the enterprise data perspective. Hence, this paper develops a probabilistic framework for analyzing privacy leakage in AI agents based on differential privacy. We model response generation as a stochastic mechanism that maps prompts and datasets to distributions over token sequences. Within this framework, we introduce token-level and message-level differential privacy and derive privacy bounds that relate privacy leakage to generation parameters such as temperature and message length. We further formulate a privacy-utility design problem that characterizes optimal temperature selection.

AI Models

rednote-hilab/dots.mocr


license: mit library_name: dots_mocr pipeline_tag: image-text-to-text tags:

  • image-to-text
  • ocr
  • document-parse
  • layout
  • table
  • formula
  • transformers
  • custom_code language:
  • en
  • zh
  • multilingual

<div align="center"> <h1 align="center"> dots.mocr </h1>

HuggingFace GitHub arXiv

<div align="center"> <a href="https://dotsocr.xiaohongshu.com" target="_blank" rel="noopener noreferrer"><strong>🖥️ Live Demo</strong></a> | <a href="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/wechat.jpg" target="_blank" rel="noopener noreferrer"><strong>💬 WeChat</strong></a> | <a href="https://www.xiaohongshu.com/user/profile/683ffe42000000001d021a4c" target="_blank" rel="noopener noreferrer"><strong>📕 rednote</strong></a> | <a href="https://x.com/rednotehilab" target="_blank" rel="noopener noreferrer"><strong>🐦 X</strong></a> </div> </div>

Introduction

We present dots.mocr. Beyond achieving state-of-the-art (SOTA) performance in standard multilingual document parsing among models of comparable size, dots.mocr excels at converting structured graphics (e.g., charts, UI layouts, scientific figures and etc.) directly into SVG code. Its core capabilities encompass grounding, recognition, semantic understanding, and interactive dialogue.

Simultaneously, we are releasing dots.mocr-svg, a variant specifically optimized for robust image-to-SVG parsing tasks.

More information can be found in the paper.

Evaluation

1. Document Parsing

1.1 Elo Score of different bench between latest models

<table> <thead> <tr> <th>models</th> <th>olmOCR-Bench</th> <th>OmniDocBench (v1.5)</th> <th>XDocParse</th> <th>Average</th> </tr> </thead> <tbody> <tr> <td>MonkeyOCR-pro-3B</td> <td>895.0</td> <td>811.3</td> <td>637.1</td> <td>781.1</td> </tr> <tr> <td>GLM-OCR</td> <td>884.2</td> <td>972.6</td> <td>820.7</td> <td>892.5</td> </tr> <tr> <td>PaddleOCR-VL-1.5</td> <td>897.3</td> <td>997.9</td> <td>866.4</td> <td>920.5</td> </tr> <tr> <td>HuanyuanOCR</td> <td>997.6</td> <td>1003.9</td> <td>951.1</td> <td>984.2</td> </tr> <tr> <td>dots.ocr</td> <td>1041.1</td> <td>1027.2</td> <td>1190.3</td> <td>1086.2</td> </tr> <!-- Highlighting dots.mocr row with bold tags --> <tr> <td><strong>dots.mocr</strong></td> <td><strong>1104.4</strong></td> <td><strong>1059.0</strong></td> <td><strong>1210.7</strong></td> <td><strong>1124.7</strong></td> </tr> <tr> <td>Gemini 3 Pro</td> <td>1180.4</td> <td>1128.0</td> <td>1323.7</td> <td>1210.7</td> </tr> </tbody> </table>

Notes:

  • Results for Gemini 3 Pro, PaddleOCR-VL-1.5, and GLM-OCR were obtained via APIs, while HuanyuanOCR results were generated using local inference.
  • The Elo score evaluation was conducted using Gemini 3 Flash. The prompt can be found at: Elo Score Prompt. These results are consistent with the findings on ocrarena.

1.2 olmOCR-bench

<table> <thead> <tr> <th>Model</th> <th>ArXiv</th> <th>Old scans math</th> <th>Tables</th> <th>Old scans</th> <th>Headers & footers</th> <th>Multi column</th> <th>Long tiny text</th> <th>Base</th> <th>Overall</th> </tr> </thead> <tbody> <tr> <td>Mistral OCR API</td> <td>77.2</td> <td>67.5</td> <td>60.6</td> <td>29.3</td> <td>93.6</td> <td>71.3</td> <td>77.1</td> <td>99.4</td> <td>72.0±1.1</td> </tr> <tr> <td>Marker 1.10.1</td> <td>83.8</td> <td>66.8</td> <td>72.9</td> <td>33.5</td> <td>86.6</td> <td>80.0</td> <td>85.7</td> <td>99.3</td> <td>76.1±1.1</td> </tr> <tr> <td>MinerU 2.5.4*</td> <td>76.6</td> <td>54.6</td> <td>84.9</td> <td>33.7</td> <td>96.6</td> <td>78.2</td> <td>83.5</td> <td>93.7</td> <td>75.2±1.1</td> </tr> <tr> <td>DeepSeek-OCR</td> <td>77.2</td> <td>73.6</td> <td>80.2</td> <td>33.3</td> <td>96.1</td> <td>66.4</td> <td>79.4</td> <td>99.8</td> <td>75.7±1.0</td> </tr> <tr> <td>Nanonets-OCR2-3B</td> <td>75.4</td> <td>46.1</td> <td>86.8</td> <td>40.9</td> <td>32.1</td> <td>81.9</td> <td>93.0</td> <td>99.6</td> <td>69.5±1.1</td> </tr> <tr> <td>PaddleOCR-VL*</td> <td>85.7</td> <td>71.0</td> <td>84.1</td> <td>37.8</td> <td>97.0</td> <td>79.9</td> <td>85.7</td> <td>98.5</td> <td>80.0±1.0</td> </tr> <tr> <td>Infinity-Parser 7B*</td> <td>84.4</td> <td>83.8</td> <td>85.0</td> <td>47.9</td> <td>88.7</td> <td>84.2</td> <td>86.4</td> <td>99.8</td> <td>82.5±?</td> </tr> <tr> <td>olmOCR v0.4.0</td> <td>83.0</td> <td>82.3</td> <td>84.9</td> <td>47.7</td> <td>96.1</td> <td>83.7</td> <td>81.9</td> <td>99.7</td> <td>82.4±1.1</td> </tr> <tr> <td>Chandra OCR 0.1.0*</td> <td>82.2</td> <td>80.3</td> <td>88.0</td> <td>50.4</td> <td>90.8</td> <td>81.2</td> <td>92.3</td> <td>99.9</td> <td>83.1±0.9</td> </tr> <tr> <td>dots.ocr</td> <td>82.1</td> <td>64.2</td> <td>88.3</td> <td>40.9</td> <td>94.1</td> <td>82.4</td> <td>81.2</td> <td>99.5</td> <td>79.1±1.0</td> </tr> <tr> <td><strong>dots.mocr</strong></td> <td><strong>85.9</strong></td> <td><strong>85.5</strong></td> <td><strong>90.7</strong></td> <td>48.2</td> <td>94.0</td> <td><strong>85.3</strong></td> <td>81.6</td> <td>99.7</td> <td><strong>83.9±0.9</strong></td> </tr> </tbody> </table>

Note:

  • The metrics are from olmocr, and our own internal evaluations.
  • We delete the Page-header and Page-footer cells in the result markdown.

1.3 Other Benchmarks

<table> <thead> <tr> <th>Model Type</th> <th>Methods</th> <th>Size</th> <th>OmniDocBench(v1.5)<br>TextEdit↓</th> <th>OmniDocBench(v1.5)<br>Read OrderEdit↓</th> <th>pdf-parse-bench</th> </tr> </thead> <tbody> <!-- GeneralVLMs Group (Reversed Order, 3 rows) --> <tr> <td rowspan="3"><strong>GeneralVLMs</strong></td> <td>Gemini-2.5 Pro</td> <td>-</td> <td>0.075</td> <td>0.097</td> <td>9.06</td> </tr> <tr> <td>Qwen3-VL-235B-A22B-Instruct</td> <td>235B</td> <td>0.069</td> <td>0.068</td> <td><strong>9.71</strong></td> </tr> <tr> <td>gemini3pro</td> <td>-</td> <td>0.066</td> <td>0.079</td> <td>9.68</td> </tr> <!-- SpecializedVLMs Group (Reversed Order, 12 rows) --> <tr> <td rowspan="12"><strong>SpecializedVLMs</strong></td> <td>Mistral OCR</td> <td>-</td> <td>0.164</td> <td>0.144</td> <td>8.84</td> </tr> <tr> <td>Deepseek-OCR</td> <td>3B</td> <td>0.073</td> <td>0.086</td> <td>8.26</td> </tr> <tr> <td>MonkeyOCR-3B</td> <td>3B</td> <td>0.075</td> <td>0.129</td> <td>9.27</td> </tr> <tr> <td>OCRVerse</td> <td>4B</td> <td>0.058</td> <td>0.071</td> <td>--</td> </tr> <tr> <td>MonkeyOCR-pro-3B</td> <td>3B</td> <td>0.075</td> <td>0.128</td> <td>-</td> </tr> <tr> <td>MinerU2.5</td> <td>1.2B</td> <td>0.047</td> <td>0.044</td> <td>-</td> </tr> <tr> <td>PaddleOCR-VL</td> <td>0.9B</td> <td>0.035</td> <td>0.043</td> <td>9.51</td> </tr> <tr> <td>HunyuanOCR</td> <td>0.9B</td> <td>0.042</td> <td>-</td> <td>-</td> </tr> <tr> <td>PaddleOCR-VL1.5</td> <td>0.9B</td> <td>0.035</td> <td>0.042</td> <td>-</td> </tr> <tr> <td>GLMOCR</td> <td>0.9B</td> <td>0.04</td> <td>0.043</td> <td>-</td> </tr> <tr> <td>dots.ocr</td> <td>3B</td> <td>0.048</td> <td>0.053</td> <td>9.29</td> </tr> <tr> <td><u><strong>dots.mocr</strong></u></td> <td>3B</td> <td><strong>0.031</strong></td> <td><strong>0.029</strong></td> <td>9.54</td> </tr> </tbody> </table>

Note:

  • Metrics are sourced from OmniDocBench and other model publications. pdf-parse-bench results are reproduced by Qwen3-VL-235B-A22B-Instruct.
  • Formula and Table metrics for OmniDocBench1.5 are omitted due to their high sensitivity to detection and matching protocols.

2. Structured Graphics Parsing

Visual languages (e.g., charts, graphics, chemical formulas, logos) encapsulate dense human knowledge. dots.mocr unifies the interpretation of these elements by parsing them directly into SVG code.

<table> <thead> <tr> <th rowspan="2" style="text-align: left;">Methods</th> <th colspan="3">Unisvg</th> <th rowspan="2">Chartmimic</th> <th rowspan="2">Design2Code</th> <th rowspan="2">Genexam</th> <th rowspan="2">SciGen</th> <th rowspan="2">ChemDraw</th> </tr> <tr> <th>Low-Level</th> <th>High-Level</th> <th>Score</th> </tr> </thead> <tbody> <tr> <td style="text-align: left;">OCRVerse</td> <td>0.632</td> <td>0.852</td> <td>0.763</td> <td>0.799</td> <td>-</td> <td>-</td> <td>-</td> <td>0.881</td> </tr> <tr> <td style="text-align: left;">Gemini 3 Pro</td> <td>0.563</td> <td>0.850</td> <td>0.735</td> <td>0.788</td> <td>0.760</td> <td>0.756</td> <td>0.783</td> <td>0.839</td> </tr> <tr> <td style="text-align: left;">dots.mocr</td> <td>0.850</td> <td>0.923</td> <td>0.894</td> <td>0.772</td> <td>0.801</td> <td>0.664</td> <td>0.660</td> <td>0.790</td> </tr> <tr> <td style="text-align: left;"><strong>dots.mocr-svg</strong></td> <td><strong>0.860</strong></td> <td><strong>0.931</strong></td> <td><strong>0.902</strong></td> <td><strong>0.905</strong></td> <td><strong>0.834</strong></td> <td><strong>0.8</strong></td> <td><strong>0.797</strong></td> <td><strong>0.901</strong></td> </tr> </tbody> </table>

Note:

  • We use the ISVGEN metric from UniSVG to evaluate the parsing result. For benchmarks that do not natively support image parsing, we use the original images as input, and calculate the ISVGEN score between the rendered output and the original image.
  • OCRVerse results are derived from various code formats (e.g., SVG, Python), whereas results for Gemini 3 Pro and dots.mocr are based specifically on SVG code.
  • Due to the capacity constraints of a 3B-parameter VLM, dots.mocr may not excel in all tasks yet like svg. To complement this, we are simultaneously releasing dots.mocr-svg. We plan to further address these limitations in future updates.

3. General Vision Tasks

<table> <thead> <tr> <th>Model</th> <th>CharXiv_descriptive</th> <th>CharXiv_reasoning</th> <th>OCR_Reasoning</th> <th>infovqa</th> <th>docvqa</th> <th>ChartQA</th> <th>OCRBench</th> <th>AI2D</th> <th>CountBenchQA</th> <th>refcoco</th> </tr> </thead> <tbody> <tr> <td>Qwen3vl-2b-instruct</td> <td>62.3</td> <td>26.8</td> <td>-</td> <td>72.4</td> <td>93.3</td> <td>-</td> <td>85.8</td> <td>76.9</td> <td>88.4</td> <td>-</td> </tr> <tr> <td>Qwen3vl-4b-instruct</td> <td>76.2</td> <td>39.7</td> <td>-</td> <td>80.3</td> <td>95.3</td> <td>-</td> <td>88.1</td> <td>84.1</td> <td>84.9</td> <td>-</td> </tr> <tr> <td><strong>dots.mocr</strong></td> <td>77.4</td> <td>55.3</td> <td>22.85</td> <td>73.76</td> <td>91.85</td> <td>83.2</td> <td>86.0</td> <td>82.16</td> <td>94.46</td> <td>80.03</td> </tr> </tbody> </table>

Quick Start

1. Installation

Install dots.mocr

conda create -n dots_mocr python=3.12
conda activate dots_mocr

git clone https://github.com/rednote-hilab/dots.mocr.git
cd dots.mocr

# Install pytorch, see https://pytorch.org/get-started/previous-versions/ for your cuda version
# pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu128
# install flash-attn==2.8.0.post2 for faster inference
pip install -e .

If you have trouble with the installation, try our Docker Image for an easier setup, and follow these steps:

Download Model Weights

💡Note: Please use a directory name without periods (e.g., DotsMOCR instead of dots.mocr) for the model save path. This is a temporary workaround pending our integration with Transformers.

python3 tools/download_model.py

# with modelscope
python3 tools/download_model.py --type modelscope

2. Deployment

vLLM inference

We highly recommend using vLLM for deployment and inference. Since vLLM version 0.11.0, Dots OCR has been officially integrated into vLLM with verified performance and you can use vLLM docker image directly (e.g, vllm/vllm-openai:v0.11.0) to deploy the model server.

# Launch vLLM model server
## dots.mocr
CUDA_VISIBLE_DEVICES=0 vllm serve rednote-hilab/dots.mocr --tensor-parallel-size 1 --gpu-memory-utilization 0.9 --chat-template-content-format string --served-model-name model --trust-remote-code

## dots.mocr-svg
CUDA_VISIBLE_DEVICES=0 vllm serve rednote-hilab/dots.mocr-svg --tensor-parallel-size 1 --gpu-memory-utilization 0.9 --chat-template-content-format string --served-model-name model --trust-remote-code

# vLLM API Demo
# See dots_mocr/model/inference.py and dots_mocr/utils/prompts.py for details on parameter and prompt settings 
# that help achieve the best output quality.
## document parsing
python3 ./demo/demo_vllm.py --prompt_mode prompt_layout_all_en 
## web parsing 
python3 ./demo/demo_vllm.py --prompt_mode prompt_web_parsing --image_path ./assets/showcase/origin/webpage_1.png
## scene spoting
python3 ./demo/demo_vllm.py --prompt_mode prompt_scene_spotting --image_path ./assets/showcase/origin/scene_1.jpg
## image parsing with svg code
python3 ./demo/demo_vllm_svg.py --prompt_mode prompt_image_to_svg 
## general qa
python3 ./demo/demo_vllm_general.py

Hugginface inference

python3 demo/demo_hf.py
<details> <summary><b>Hugginface inference details</b></summary>
import torch
from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer
from qwen_vl_utils import process_vision_info
from dots_mocr.utils import dict_promptmode_to_prompt

model_path = "./weights/DotsMOCR"
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

image_path = "demo/demo_image1.jpg"
prompt = """Please output the layout information from the PDF image, including each layout element's bbox, its category, and the corresponding text content within the bbox.

1. Bbox format: [x1, y1, x2, y2]

2. Layout Categories: The possible categories are ['Caption', 'Footnote', 'Formula', 'List-item', 'Page-footer', 'Page-header', 'Picture', 'Section-header', 'Table', 'Text', 'Title'].

3. Text Extraction & Formatting Rules:
    - Picture: For the 'Picture' category, the text field should be omitted.
    - Formula: Format its text as LaTeX.
    - Table: Format its text as HTML.
    - All Others (Text, Title, etc.): Format their text as Markdown.

4. Constraints:
    - The output text must be the original text from the image, with no translation.
    - All layout elements must be sorted according to human reading order.

5. Final Output: The entire output must be a single JSON object.
"""

messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": image_path
                },
                {"type": "text", "text": prompt}
            ]
        }
    ]

# Preparation for inference
text = processor.apply_chat_template(
    messages, 
    tokenize=False, 
    add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)

inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=24000)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

</details>

Hugginface inference with CPU

Please refer to CPU inference

3. Document Parse

Based on vLLM server, you can parse an image or a pdf file using the following commands:


# Parse all layout info, both detection and recognition
# Parse a single image
python3 dots_mocr/parser.py demo/demo_image1.jpg
# Parse a single PDF
python3 dots_mocr/parser.py demo/demo_pdf1.pdf  --num_thread 64  # try bigger num_threads for pdf with a large number of pages

# Layout detection only
python3 dots_mocr/parser.py demo/demo_image1.jpg --prompt prompt_layout_only_en

# Parse text only, except Page-header and Page-footer
python3 dots_mocr/parser.py demo/demo_image1.jpg --prompt prompt_ocr


Based on Transformers, you can parse an image or a pdf file using the same commands above, just add --use_hf true.

Notice: transformers is slower than vllm, if you want to use demo/* with transformers,just add use_hf=True in DotsMOCRParser(..,use_hf=True)

<details> <summary><b>Output Results</b></summary>
  1. Structured Layout Data (demo_image1.json): A JSON file containing the detected layout elements, including their bounding boxes, categories, and extracted text.
  2. Processed Markdown File (demo_image1.md): A Markdown file generated from the concatenated text of all detected cells.
    • An additional version, demo_image1_nohf.md, is also provided, which excludes page headers and footers for compatibility with benchmarks like Omnidocbench and olmOCR-bench.
  3. Layout Visualization (demo_image1.jpg): The original image with the detected layout bounding boxes drawn on it.
</details>

4. Demo

Have fun with the live demo.

Examples for document parsing

<img src="https://raw.githubusercontent.com/rednote-hilab/dots.mocr/master/assets/showcase/result/formula1.png" alt="formula1.png" border="0" /> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.mocr/master/assets/showcase/result/table3.png" alt="table3.png" border="0" /> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.mocr/master/assets/showcase/result/Tibetan.png" alt="Tibetan.png" border="0" /> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.mocr/master/assets/showcase/result/tradition_zh.png" alt="tradition_zh.png" border="0" /> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.mocr/master/assets/showcase/result/nl.png" alt="nl.png" border="0" /> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.mocr/master/assets/showcase/result/kannada.png" alt="kannada.png" border="0" /> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.mocr/master/assets/showcase/result/russian.png" alt="russian.png" border="0" />

Examples for image parsing

<img src="https://raw.githubusercontent.com/rednote-hilab/dots.mocr/master/assets/showcase/result/svg_1.png" alt="svg_1.png" border="0" /> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.mocr/master/assets/showcase/result/svg_2.png" alt="svg_2.png" border="0" /> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.mocr/master/assets/showcase/result/svg_4.png" alt="svg_4.png" border="0" /> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.mocr/master/assets/showcase/result/svg_5.png" alt="svg_5.png" border="0" /> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.mocr/master/assets/showcase/result/svg_6.png" alt="svg_6.png" border="0" />

Note:

  • Inferenced by dots.mocr-svg

Example for web parsing

<img src="https://raw.githubusercontent.com/rednote-hilab/dots.mocr/master/assets/showcase/result/webpage_1.png" alt="webpage_1.png" border="0" /> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.mocr/master/assets/showcase/result/webpage_2.png" alt="webpage_2.png" border="0" />

Examples for scene spotting

<img src="https://raw.githubusercontent.com/rednote-hilab/dots.mocr/master/assets/showcase/result/scene_1.png" alt="scene_1.png" border="0" /> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.mocr/master/assets/showcase/result/scene_2.png" alt="scene_2.png" border="0" />

Limitation & Future Work

  • Complex Document Elements:

    • Table&Formula: The extraction of complex tables and mathematical formulas persists as a difficult task given the model's compact architecture.
    • Picture: We have adopted an SVG code representation for parsing structured graphics; however, the performance has yet to achieve the desired level of robustness.
  • Parsing Failures: While we have reduced the rate of parsing failures compared to the previous version, these issues may still occur occasionally. We remain committed to further resolving these edge cases in future updates.

Citation

@misc{zheng2026multimodalocrparsedocuments,
      title={Multimodal OCR: Parse Anything from Documents}, 
      author={Handong Zheng and Yumeng Li and Kaile Zhang and Liang Xin and Guangwei Zhao and Hao Liu and Jiayu Chen and Jie Lou and Jiyu Qiu and Qi Fu and Rui Yang and Shuo Jiang and Weijian Luo and Weijie Su and Weijun Zhang and Xingyu Zhu and Yabin Li and Yiwei ma and Yu Chen and Zhaohui Yu and Guang Yang and Colin Zhang and Lei Zhang and Yuliang Liu and Xiang Bai},
      year={2026},
      eprint={2603.13032},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.13032}, 
}

Author: rednote-hilab

Likes: 11

Downloads: 0

Tags: dots_mocr, safetensors, dots_ocr, text-generation, image-to-text, ocr, document-parse, layout, table, formula, transformers, custom_code, image-text-to-text, conversational, en, zh, multilingual, arxiv:2603.13032, license:mit, region:us

Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-GGUF


language:

  • en
  • zh
  • ko license: apache-2.0 base_model: Qwen/Qwen3.5-27B tags:
  • unsloth
  • qwen
  • qwen3.5
  • reasoning
  • chain-of-thought
  • lora pipeline_tag: image-text-to-text datasets:
  • nohurry/Opus-4.6-Reasoning-3000x-filtered
  • Jackrong/Qwen3.5-reasoning-700x
  • Roman1111111/claude-opus-4.6-10000x

🌟 Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2

📢 Announcement

v2 Update: This iteration is powered by 14,000+ premium Claude 4.6 Opus-style general reasoning samples, with a major focus on achieving massive gains in reasoning efficiency while actively improving peak accuracy.

v2 introduces a refined reasoning scaffold designed to eliminate redundant internal loops, significantly improving the model's cross-task generalization from logic and math into specialized fields like programming. Compared to the original model, autonomy and stability are significantly improved, ensuring the model remains robust and self-consistent during complex, multi-step problem solving. v2 is built to think smarter, not longer, delivering substantial improvements in inference speed and cost-effectiveness while simultaneously boosting baseline accuracy.

Note: Due to the constraints of SFT sample size and training scope, the model's broad general-purpose capabilities might be slightly impacted. The efficiency and accuracy results discussed here are based on the HumanEval and HumanEval+ benchmarks. Thank you for your understanding!

HCaJnUQaoAAaMIc

💡 Model Introduction

Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2 is the second iteration of this reasoning-focused Qwen3.5-27B fine-tune, built to drastically improve the efficiency of chain-of-thought generation, unlocking highly substantial gains in reasoning speed and cost-reduction while actually increasing absolute accuracy.

Compared with the earlier version, v2 was trained with 14,000 Claude 4.6 Opus-style general reasoning samples, with a stronger emphasis on transferring concise, reusable reasoning patterns rather than only maximizing raw benchmark scores. The goal of v2 is not simply to make the model "think more," but to help it think more economically: reducing unnecessarily long internal chains, avoiding verbose over-analysis on easy problems, and massively improving the reasoning-cost-to-quality ratio while beating the baseline's benchmark correctness.

A key design choice in v2 is that the distillation data is primarily general-domain reasoning data—specifically focused on mathematics, word problems, logical deduction, and a balanced mix of general knowledge and instructions—rather than specialized code-heavy supervision. Consequently, HumanEval and HumanEval+ are employed here to evaluate cross-task generalization and capability transfer, rather than serving as direct optimization targets. High performance on these benchmarks, despite the lack of code-centric training, confirms that the model's reasoning scaffold has become more robust and transferable, proving that fundamental reasoning logic can effectively power specialized tasks like programming.

Why v2 matters

  • ⚠️The model's scores are still under evaluation...

🗺️ Training Pipeline Overview

Base Model (Qwen3.5-27B)
 │
 ▼
Qwen3.5-27B fine-tuned with Unsloth
 │
 ▼
Supervised Fine-Tuning (SFT) + LoRA
(Response-Only Training masked on "<|im_start|>assistant\n<think>")
 │
 ▼
Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2

🧠 Example of Learned Reasoning Scaffold(Example)

The model includes targeted optimizations addressing Qwen3.5’s tendency toward excessive transitional or repetitive reasoning on simple queries. Through deep distillation and structural imitation of Claude-4.6-Opus reasoning chains, the model adopts a more efficient structured thinking pattern:
“Let me analyze this request carefully: 1..2..3...”.
This streamlined reasoning paradigm significantly reduces redundant cognitive loops while preserving deep analytical capacity, resulting in substantially improved inference efficiency.

Let me analyze this request carefully:

1. Identify the core objective of the problem.
2. Break the task into clearly defined subcomponents.
3. Evaluate constraints and edge cases.
4. Formulate a step-by-step solution plan.
5. Execute the reasoning sequentially and verify consistency.
            .
            .
            .

📚 All Datasets Used

The dataset consists of high-quality, filtered reasoning distillation data:

| Dataset Name | Description / Purpose | |--------------|-----------------------| | nohurry/Opus-4.6-Reasoning-3000x-filtered | Provides comprehensive Claude 4.6 Opus reasoning trajectories. | | Roman1111111/claude-opus-4.6-10000x | Large-scale public Claude 4.6 Opus distillation data used to strengthen general reasoning transfer in v2. | | TeichAI/claude-4.5-opus-high-reasoning-250x | Injecting high-intensity, structured reasoning instances. | | Jackrong/Qwen3.5-reasoning-700x | Additional curated reasoning samples designed to strengthen structured step-by-step problem solving and improve reasoning diversity. |

⚠️ Limitations & Intended Use

  • Hallucination Risk: While reasoning is strong, the model remains an autoregressive LLM; external facts provided during the thinking sequence may occasionally contain hallucinations if verifying real-world events.
  • Intended Scenario: Best suited for offline analytical tasks, coding, math, and heavy logic-dependent prompting where the user needs to transparently follow the AI's internal logic.
  • This model is a test version intended solely for learning and demonstration purposes, and is for academic research and technical exploration use only.

🙏 Acknowledgements

Significant thanks to the Unsloth AI team for making rapid fine-tuning of large LLM models accessible. Additionally, we acknowledge Qwen internally, and the open-source community developers producing exceptional distilled datasets.

Author: Jackrong

Likes: 10

Downloads: 0

Tags: gguf, qwen3_5, unsloth, qwen, qwen3.5, reasoning, chain-of-thought, lora, image-text-to-text, en, zh, ko, dataset:nohurry/Opus-4.6-Reasoning-3000x-filtered, dataset:Jackrong/Qwen3.5-reasoning-700x, dataset:Roman1111111/claude-opus-4.6-10000x, base_model:Qwen/Qwen3.5-27B, base_model:adapter:Qwen/Qwen3.5-27B, license:apache-2.0, endpoints_compatible, region:us, conversational

rednote-hilab/dots.mocr-svg


license: mit library_name: dots_mocr pipeline_tag: image-text-to-text tags:

  • image-to-text
  • ocr
  • document-parse
  • layout
  • table
  • formula
  • transformers
  • custom_code language:
  • en
  • zh
  • multilingual

<div align="center"> <h1 align="center"> dots.mocr </h1>

HuggingFace GitHub arXiv

<div align="center"> <a href="https://dotsocr.xiaohongshu.com" target="_blank" rel="noopener noreferrer"><strong>🖥️ Live Demo</strong></a> | <a href="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/wechat.jpg" target="_blank" rel="noopener noreferrer"><strong>💬 WeChat</strong></a> | <a href="https://www.xiaohongshu.com/user/profile/683ffe42000000001d021a4c" target="_blank" rel="noopener noreferrer"><strong>📕 rednote</strong></a> | <a href="https://x.com/rednotehilab" target="_blank" rel="noopener noreferrer"><strong>🐦 X</strong></a> </div> </div>

Introduction

We present dots.mocr. Beyond achieving state-of-the-art (SOTA) performance in standard multilingual document parsing among models of comparable size, dots.mocr excels at converting structured graphics (e.g., charts, UI layouts, scientific figures and etc.) directly into SVG code. Its core capabilities encompass grounding, recognition, semantic understanding, and interactive dialogue.

Simultaneously, we are releasing dots.mocr-svg, a variant specifically optimized for robust image-to-SVG parsing tasks.

More information can be found in the paper.

Evaluation

1. Document Parsing

1.1 Elo Score of different bench between latest models

<table> <thead> <tr> <th>models</th> <th>olmOCR-Bench</th> <th>OmniDocBench (v1.5)</th> <th>XDocParse</th> <th>Average</th> </tr> </thead> <tbody> <tr> <td>MonkeyOCR-pro-3B</td> <td>895.0</td> <td>811.3</td> <td>637.1</td> <td>781.1</td> </tr> <tr> <td>GLM-OCR</td> <td>884.2</td> <td>972.6</td> <td>820.7</td> <td>892.5</td> </tr> <tr> <td>PaddleOCR-VL-1.5</td> <td>897.3</td> <td>997.9</td> <td>866.4</td> <td>920.5</td> </tr> <tr> <td>HuanyuanOCR</td> <td>997.6</td> <td>1003.9</td> <td>951.1</td> <td>984.2</td> </tr> <tr> <td>dots.ocr</td> <td>1041.1</td> <td>1027.2</td> <td>1190.3</td> <td>1086.2</td> </tr> <!-- Highlighting dots.mocr row with bold tags --> <tr> <td><strong>dots.mocr</strong></td> <td><strong>1104.4</strong></td> <td><strong>1059.0</strong></td> <td><strong>1210.7</strong></td> <td><strong>1124.7</strong></td> </tr> <tr> <td>Gemini 3 Pro</td> <td>1180.4</td> <td>1128.0</td> <td>1323.7</td> <td>1210.7</td> </tr> </tbody> </table>

Notes:

  • Results for Gemini 3 Pro, PaddleOCR-VL-1.5, and GLM-OCR were obtained via APIs, while HuanyuanOCR results were generated using local inference.
  • The Elo score evaluation was conducted using Gemini 3 Flash. The prompt can be found at: Elo Score Prompt. These results are consistent with the findings on ocrarena.

1.2 olmOCR-bench

<table> <thead> <tr> <th>Model</th> <th>ArXiv</th> <th>Old scans math</th> <th>Tables</th> <th>Old scans</th> <th>Headers & footers</th> <th>Multi column</th> <th>Long tiny text</th> <th>Base</th> <th>Overall</th> </tr> </thead> <tbody> <tr> <td>Mistral OCR API</td> <td>77.2</td> <td>67.5</td> <td>60.6</td> <td>29.3</td> <td>93.6</td> <td>71.3</td> <td>77.1</td> <td>99.4</td> <td>72.0±1.1</td> </tr> <tr> <td>Marker 1.10.1</td> <td>83.8</td> <td>66.8</td> <td>72.9</td> <td>33.5</td> <td>86.6</td> <td>80.0</td> <td>85.7</td> <td>99.3</td> <td>76.1±1.1</td> </tr> <tr> <td>MinerU 2.5.4*</td> <td>76.6</td> <td>54.6</td> <td>84.9</td> <td>33.7</td> <td>96.6</td> <td>78.2</td> <td>83.5</td> <td>93.7</td> <td>75.2±1.1</td> </tr> <tr> <td>DeepSeek-OCR</td> <td>77.2</td> <td>73.6</td> <td>80.2</td> <td>33.3</td> <td>96.1</td> <td>66.4</td> <td>79.4</td> <td>99.8</td> <td>75.7±1.0</td> </tr> <tr> <td>Nanonets-OCR2-3B</td> <td>75.4</td> <td>46.1</td> <td>86.8</td> <td>40.9</td> <td>32.1</td> <td>81.9</td> <td>93.0</td> <td>99.6</td> <td>69.5±1.1</td> </tr> <tr> <td>PaddleOCR-VL*</td> <td>85.7</td> <td>71.0</td> <td>84.1</td> <td>37.8</td> <td>97.0</td> <td>79.9</td> <td>85.7</td> <td>98.5</td> <td>80.0±1.0</td> </tr> <tr> <td>Infinity-Parser 7B*</td> <td>84.4</td> <td>83.8</td> <td>85.0</td> <td>47.9</td> <td>88.7</td> <td>84.2</td> <td>86.4</td> <td>99.8</td> <td>82.5±?</td> </tr> <tr> <td>olmOCR v0.4.0</td> <td>83.0</td> <td>82.3</td> <td>84.9</td> <td>47.7</td> <td>96.1</td> <td>83.7</td> <td>81.9</td> <td>99.7</td> <td>82.4±1.1</td> </tr> <tr> <td>Chandra OCR 0.1.0*</td> <td>82.2</td> <td>80.3</td> <td>88.0</td> <td>50.4</td> <td>90.8</td> <td>81.2</td> <td>92.3</td> <td>99.9</td> <td>83.1±0.9</td> </tr> <tr> <td>dots.ocr</td> <td>82.1</td> <td>64.2</td> <td>88.3</td> <td>40.9</td> <td>94.1</td> <td>82.4</td> <td>81.2</td> <td>99.5</td> <td>79.1±1.0</td> </tr> <tr> <td><strong>dots.mocr</strong></td> <td><strong>85.9</strong></td> <td><strong>85.5</strong></td> <td><strong>90.7</strong></td> <td>48.2</td> <td>94.0</td> <td><strong>85.3</strong></td> <td>81.6</td> <td>99.7</td> <td><strong>83.9±0.9</strong></td> </tr> </tbody> </table>

Note:

  • The metrics are from olmocr, and our own internal evaluations.
  • We delete the Page-header and Page-footer cells in the result markdown.

1.3 Other Benchmarks

<table> <thead> <tr> <th>Model Type</th> <th>Methods</th> <th>Size</th> <th>OmniDocBench(v1.5)<br>TextEdit↓</th> <th>OmniDocBench(v1.5)<br>Read OrderEdit↓</th> <th>pdf-parse-bench</th> </tr> </thead> <tbody> <!-- GeneralVLMs Group (Reversed Order, 3 rows) --> <tr> <td rowspan="3"><strong>GeneralVLMs</strong></td> <td>Gemini-2.5 Pro</td> <td>-</td> <td>0.075</td> <td>0.097</td> <td>9.06</td> </tr> <tr> <td>Qwen3-VL-235B-A22B-Instruct</td> <td>235B</td> <td>0.069</td> <td>0.068</td> <td><strong>9.71</strong></td> </tr> <tr> <td>gemini3pro</td> <td>-</td> <td>0.066</td> <td>0.079</td> <td>9.68</td> </tr> <!-- SpecializedVLMs Group (Reversed Order, 12 rows) --> <tr> <td rowspan="12"><strong>SpecializedVLMs</strong></td> <td>Mistral OCR</td> <td>-</td> <td>0.164</td> <td>0.144</td> <td>8.84</td> </tr> <tr> <td>Deepseek-OCR</td> <td>3B</td> <td>0.073</td> <td>0.086</td> <td>8.26</td> </tr> <tr> <td>MonkeyOCR-3B</td> <td>3B</td> <td>0.075</td> <td>0.129</td> <td>9.27</td> </tr> <tr> <td>OCRVerse</td> <td>4B</td> <td>0.058</td> <td>0.071</td> <td>--</td> </tr> <tr> <td>MonkeyOCR-pro-3B</td> <td>3B</td> <td>0.075</td> <td>0.128</td> <td>-</td> </tr> <tr> <td>MinerU2.5</td> <td>1.2B</td> <td>0.047</td> <td>0.044</td> <td>-</td> </tr> <tr> <td>PaddleOCR-VL</td> <td>0.9B</td> <td>0.035</td> <td>0.043</td> <td>9.51</td> </tr> <tr> <td>HunyuanOCR</td> <td>0.9B</td> <td>0.042</td> <td>-</td> <td>-</td> </tr> <tr> <td>PaddleOCR-VL1.5</td> <td>0.9B</td> <td>0.035</td> <td>0.042</td> <td>-</td> </tr> <tr> <td>GLMOCR</td> <td>0.9B</td> <td>0.04</td> <td>0.043</td> <td>-</td> </tr> <tr> <td>dots.ocr</td> <td>3B</td> <td>0.048</td> <td>0.053</td> <td>9.29</td> </tr> <tr> <td><u><strong>dots.mocr</strong></u></td> <td>3B</td> <td><strong>0.031</strong></td> <td><strong>0.029</strong></td> <td>9.54</td> </tr> </tbody> </table>

Note:

  • Metrics are sourced from OmniDocBench and other model publications. pdf-parse-bench results are reproduced by Qwen3-VL-235B-A22B-Instruct.
  • Formula and Table metrics for OmniDocBench1.5 are omitted due to their high sensitivity to detection and matching protocols.

2. Structured Graphics Parsing

Visual languages (e.g., charts, graphics, chemical formulas, logos) encapsulate dense human knowledge. dots.mocr unifies the interpretation of these elements by parsing them directly into SVG code.

<table> <thead> <tr> <th rowspan="2" style="text-align: left;">Methods</th> <th colspan="3">Unisvg</th> <th rowspan="2">Chartmimic</th> <th rowspan="2">Design2Code</th> <th rowspan="2">Genexam</th> <th rowspan="2">SciGen</th> <th rowspan="2">ChemDraw</th> </tr> <tr> <th>Low-Level</th> <th>High-Level</th> <th>Score</th> </tr> </thead> <tbody> <tr> <td style="text-align: left;">OCRVerse</td> <td>0.632</td> <td>0.852</td> <td>0.763</td> <td>0.799</td> <td>-</td> <td>-</td> <td>-</td> <td>0.881</td> </tr> <tr> <td style="text-align: left;">Gemini 3 Pro</td> <td>0.563</td> <td>0.850</td> <td>0.735</td> <td>0.788</td> <td>0.760</td> <td>0.756</td> <td>0.783</td> <td>0.839</td> </tr> <tr> <td style="text-align: left;">dots.mocr</td> <td>0.850</td> <td>0.923</td> <td>0.894</td> <td>0.772</td> <td>0.801</td> <td>0.664</td> <td>0.660</td> <td>0.790</td> </tr> <tr> <td style="text-align: left;"><strong>dots.mocr-svg</strong></td> <td><strong>0.860</strong></td> <td><strong>0.931</strong></td> <td><strong>0.902</strong></td> <td><strong>0.905</strong></td> <td><strong>0.834</strong></td> <td><strong>0.8</strong></td> <td><strong>0.797</strong></td> <td><strong>0.901</strong></td> </tr> </tbody> </table>

Note:

  • We use the ISVGEN metric from UniSVG to evaluate the parsing result. For benchmarks that do not natively support image parsing, we use the original images as input, and calculate the ISVGEN score between the rendered output and the original image.
  • OCRVerse results are derived from various code formats (e.g., SVG, Python), whereas results for Gemini 3 Pro and dots.mocr are based specifically on SVG code.
  • Due to the capacity constraints of a 3B-parameter VLM, dots.mocr may not excel in all tasks yet like svg. To complement this, we are simultaneously releasing dots.mocr-svg. We plan to further address these limitations in future updates.

3. General Vision Tasks

<table> <thead> <tr> <th>Model</th> <th>CharXiv_descriptive</th> <th>CharXiv_reasoning</th> <th>OCR_Reasoning</th> <th>infovqa</th> <th>docvqa</th> <th>ChartQA</th> <th>OCRBench</th> <th>AI2D</th> <th>CountBenchQA</th> <th>refcoco</th> </tr> </thead> <tbody> <tr> <td>Qwen3vl-2b-instruct</td> <td>62.3</td> <td>26.8</td> <td>-</td> <td>72.4</td> <td>93.3</td> <td>-</td> <td>85.8</td> <td>76.9</td> <td>88.4</td> <td>-</td> </tr> <tr> <td>Qwen3vl-4b-instruct</td> <td>76.2</td> <td>39.7</td> <td>-</td> <td>80.3</td> <td>95.3</td> <td>-</td> <td>88.1</td> <td>84.1</td> <td>84.9</td> <td>-</td> </tr> <tr> <td><strong>dots.mocr</strong></td> <td>77.4</td> <td>55.3</td> <td>22.85</td> <td>73.76</td> <td>91.85</td> <td>83.2</td> <td>86.0</td> <td>82.16</td> <td>94.46</td> <td>80.03</td> </tr> </tbody> </table>

Quick Start

1. Installation

Install dots.mocr

conda create -n dots_mocr python=3.12
conda activate dots_mocr

git clone https://github.com/rednote-hilab/dots.mocr.git
cd dots.mocr

# Install pytorch, see https://pytorch.org/get-started/previous-versions/ for your cuda version
# pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu128
# install flash-attn==2.8.0.post2 for faster inference
pip install -e .

If you have trouble with the installation, try our Docker Image for an easier setup, and follow these steps:

Download Model Weights

💡Note: Please use a directory name without periods (e.g., DotsMOCR instead of dots.mocr) for the model save path. This is a temporary workaround pending our integration with Transformers.

python3 tools/download_model.py

# with modelscope
python3 tools/download_model.py --type modelscope

2. Deployment

vLLM inference

We highly recommend using vLLM for deployment and inference. Since vLLM version 0.11.0, Dots OCR has been officially integrated into vLLM with verified performance and you can use vLLM docker image directly (e.g, vllm/vllm-openai:v0.11.0) to deploy the model server.

# Launch vLLM model server
## dots.mocr
CUDA_VISIBLE_DEVICES=0 vllm serve rednote-hilab/dots.mocr --tensor-parallel-size 1 --gpu-memory-utilization 0.9 --chat-template-content-format string --served-model-name model --trust-remote-code

## dots.mocr-svg
CUDA_VISIBLE_DEVICES=0 vllm serve rednote-hilab/dots.mocr-svg --tensor-parallel-size 1 --gpu-memory-utilization 0.9 --chat-template-content-format string --served-model-name model --trust-remote-code

# vLLM API Demo
# See dots_mocr/model/inference.py and dots_mocr/utils/prompts.py for details on parameter and prompt settings 
# that help achieve the best output quality.
## document parsing
python3 ./demo/demo_vllm.py --prompt_mode prompt_layout_all_en 
## web parsing 
python3 ./demo/demo_vllm.py --prompt_mode prompt_web_parsing --image_path ./assets/showcase/origin/webpage_1.png
## scene spoting
python3 ./demo/demo_vllm.py --prompt_mode prompt_scene_spotting --image_path ./assets/showcase/origin/scene_1.jpg
## image parsing with svg code
python3 ./demo/demo_vllm_svg.py --prompt_mode prompt_image_to_svg 
## general qa
python3 ./demo/demo_vllm_general.py

Hugginface inference

python3 demo/demo_hf.py
<details> <summary><b>Hugginface inference details</b></summary>
import torch
from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer
from qwen_vl_utils import process_vision_info
from dots_mocr.utils import dict_promptmode_to_prompt

model_path = "./weights/DotsMOCR"
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

image_path = "demo/demo_image1.jpg"
prompt = """Please output the layout information from the PDF image, including each layout element's bbox, its category, and the corresponding text content within the bbox.

1. Bbox format: [x1, y1, x2, y2]

2. Layout Categories: The possible categories are ['Caption', 'Footnote', 'Formula', 'List-item', 'Page-footer', 'Page-header', 'Picture', 'Section-header', 'Table', 'Text', 'Title'].

3. Text Extraction & Formatting Rules:
    - Picture: For the 'Picture' category, the text field should be omitted.
    - Formula: Format its text as LaTeX.
    - Table: Format its text as HTML.
    - All Others (Text, Title, etc.): Format their text as Markdown.

4. Constraints:
    - The output text must be the original text from the image, with no translation.
    - All layout elements must be sorted according to human reading order.

5. Final Output: The entire output must be a single JSON object.
"""

messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": image_path
                },
                {"type": "text", "text": prompt}
            ]
        }
    ]

# Preparation for inference
text = processor.apply_chat_template(
    messages, 
    tokenize=False, 
    add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)

inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=24000)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

</details>

Hugginface inference with CPU

Please refer to CPU inference

3. Document Parse

Based on vLLM server, you can parse an image or a pdf file using the following commands:


# Parse all layout info, both detection and recognition
# Parse a single image
python3 dots_mocr/parser.py demo/demo_image1.jpg
# Parse a single PDF
python3 dots_mocr/parser.py demo/demo_pdf1.pdf  --num_thread 64  # try bigger num_threads for pdf with a large number of pages

# Layout detection only
python3 dots_mocr/parser.py demo/demo_image1.jpg --prompt prompt_layout_only_en

# Parse text only, except Page-header and Page-footer
python3 dots_mocr/parser.py demo/demo_image1.jpg --prompt prompt_ocr


Based on Transformers, you can parse an image or a pdf file using the same commands above, just add --use_hf true.

Notice: transformers is slower than vllm, if you want to use demo/* with transformers,just add use_hf=True in DotsMOCRParser(..,use_hf=True)

<details> <summary><b>Output Results</b></summary>
  1. Structured Layout Data (demo_image1.json): A JSON file containing the detected layout elements, including their bounding boxes, categories, and extracted text.
  2. Processed Markdown File (demo_image1.md): A Markdown file generated from the concatenated text of all detected cells.
    • An additional version, demo_image1_nohf.md, is also provided, which excludes page headers and footers for compatibility with benchmarks like Omnidocbench and olmOCR-bench.
  3. Layout Visualization (demo_image1.jpg): The original image with the detected layout bounding boxes drawn on it.
</details>

4. Demo

Have fun with the live demo.

Examples for document parsing

<img src="https://raw.githubusercontent.com/rednote-hilab/dots.mocr/master/assets/showcase/result/formula1.png" alt="formula1.png" border="0" /> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.mocr/master/assets/showcase/result/table3.png" alt="table3.png" border="0" /> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.mocr/master/assets/showcase/result/Tibetan.png" alt="Tibetan.png" border="0" /> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.mocr/master/assets/showcase/result/tradition_zh.png" alt="tradition_zh.png" border="0" /> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.mocr/master/assets/showcase/result/nl.png" alt="nl.png" border="0" /> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.mocr/master/assets/showcase/result/kannada.png" alt="kannada.png" border="0" /> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.mocr/master/assets/showcase/result/russian.png" alt="russian.png" border="0" />

Examples for image parsing

<img src="https://raw.githubusercontent.com/rednote-hilab/dots.mocr/master/assets/showcase/result/svg_1.png" alt="svg_1.png" border="0" /> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.mocr/master/assets/showcase/result/svg_2.png" alt="svg_2.png" border="0" /> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.mocr/master/assets/showcase/result/svg_4.png" alt="svg_4.png" border="0" /> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.mocr/master/assets/showcase/result/svg_5.png" alt="svg_5.png" border="0" /> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.mocr/master/assets/showcase/result/svg_6.png" alt="svg_6.png" border="0" />

Note:

  • Inferenced by dots.mocr-svg

Example for web parsing

<img src="https://raw.githubusercontent.com/rednote-hilab/dots.mocr/master/assets/showcase/result/webpage_1.png" alt="webpage_1.png" border="0" /> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.mocr/master/assets/showcase/result/webpage_2.png" alt="webpage_2.png" border="0" />

Examples for scene spotting

<img src="https://raw.githubusercontent.com/rednote-hilab/dots.mocr/master/assets/showcase/result/scene_1.png" alt="scene_1.png" border="0" /> <img src="https://raw.githubusercontent.com/rednote-hilab/dots.mocr/master/assets/showcase/result/scene_2.png" alt="scene_2.png" border="0" />

Limitation & Future Work

  • Complex Document Elements:

    • Table&Formula: The extraction of complex tables and mathematical formulas persists as a difficult task given the model's compact architecture.
    • Picture: We have adopted an SVG code representation for parsing structured graphics; however, the performance has yet to achieve the desired level of robustness.
  • Parsing Failures: While we have reduced the rate of parsing failures compared to the previous version, these issues may still occur occasionally. We remain committed to further resolving these edge cases in future updates.

Citation

@misc{zheng2026multimodalocrparsedocuments,
      title={Multimodal OCR: Parse Anything from Documents}, 
      author={Handong Zheng and Yumeng Li and Kaile Zhang and Liang Xin and Guangwei Zhao and Hao Liu and Jiayu Chen and Jie Lou and Jiyu Qiu and Qi Fu and Rui Yang and Shuo Jiang and Weijian Luo and Weijie Su and Weijun Zhang and Xingyu Zhu and Yabin Li and Yiwei ma and Yu Chen and Zhaohui Yu and Guang Yang and Colin Zhang and Lei Zhang and Yuliang Liu and Xiang Bai},
      year={2026},
      eprint={2603.13032},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.13032}, 
}

Author: rednote-hilab

Likes: 5

Downloads: 0

Tags: dots_mocr, safetensors, dots_ocr, text-generation, image-to-text, ocr, document-parse, layout, table, formula, transformers, custom_code, image-text-to-text, conversational, en, zh, multilingual, arxiv:2603.13032, license:mit, region:us

Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2


language:

  • en
  • zh
  • ko license: apache-2.0 base_model: Qwen/Qwen3.5-27B tags:
  • unsloth
  • qwen
  • qwen3.5
  • reasoning
  • chain-of-thought
  • lora pipeline_tag: image-text-to-text datasets:
  • nohurry/Opus-4.6-Reasoning-3000x-filtered
  • Jackrong/Qwen3.5-reasoning-700x
  • Roman1111111/claude-opus-4.6-10000x

🌟 Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2

📢 Announcement

v2 Update: This iteration is powered by 14,000+ premium Claude 4.6 Opus-style general reasoning samples, with a major focus on achieving massive gains in reasoning efficiency while actively improving peak accuracy.

v2 introduces a refined reasoning scaffold designed to eliminate redundant internal loops, significantly improving the model's cross-task generalization from logic and math into specialized fields like programming. Compared to the original model, autonomy and stability are significantly improved, ensuring the model remains robust and self-consistent during complex, multi-step problem solving. v2 is built to think smarter, not longer, delivering substantial improvements in inference speed and cost-effectiveness while simultaneously boosting baseline accuracy.

Note: Due to the constraints of SFT sample size and training scope, the model's broad general-purpose capabilities might be slightly impacted. The efficiency and accuracy results discussed here are based on the HumanEval and HumanEval+ benchmarks. Thank you for your understanding!

HCaJnUQaoAAaMIc

💡 Model Introduction

Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2 is the second iteration of this reasoning-focused Qwen3.5-27B fine-tune, built to drastically improve the efficiency of chain-of-thought generation, unlocking highly substantial gains in reasoning speed and cost-reduction while actually increasing absolute accuracy.

Compared with the earlier version, v2 was trained with 14,000 Claude 4.6 Opus-style general reasoning samples, with a stronger emphasis on transferring concise, reusable reasoning patterns rather than only maximizing raw benchmark scores. The goal of v2 is not simply to make the model "think more," but to help it think more economically: reducing unnecessarily long internal chains, avoiding verbose over-analysis on easy problems, and massively improving the reasoning-cost-to-quality ratio while beating the baseline's benchmark correctness.

A key design choice in v2 is that the distillation data is primarily general-domain reasoning data—specifically focused on mathematics, word problems, logical deduction, and a balanced mix of general knowledge and instructions—rather than specialized code-heavy supervision. Consequently, HumanEval and HumanEval+ are employed here to evaluate cross-task generalization and capability transfer, rather than serving as direct optimization targets. High performance on these benchmarks, despite the lack of code-centric training, confirms that the model's reasoning scaffold has become more robust and transferable, proving that fundamental reasoning logic can effectively power specialized tasks like programming.

Why v2 matters

  • ⚠️The model's scores are still under evaluation...

🗺️ Training Pipeline Overview

Base Model (Qwen3.5-27B)
 │
 ▼
Qwen3.5-27B fine-tuned with Unsloth
 │
 ▼
Supervised Fine-Tuning (SFT) + LoRA
(Response-Only Training masked on "<|im_start|>assistant\n<think>")
 │
 ▼
Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2

🧠 Example of Learned Reasoning Scaffold(Example)

The model includes targeted optimizations addressing Qwen3.5’s tendency toward excessive transitional or repetitive reasoning on simple queries. Through deep distillation and structural imitation of Claude-4.6-Opus reasoning chains, the model adopts a more efficient structured thinking pattern:
“Let me analyze this request carefully: 1..2..3...”.
This streamlined reasoning paradigm significantly reduces redundant cognitive loops while preserving deep analytical capacity, resulting in substantially improved inference efficiency.

Let me analyze this request carefully:

1. Identify the core objective of the problem.
2. Break the task into clearly defined subcomponents.
3. Evaluate constraints and edge cases.
4. Formulate a step-by-step solution plan.
5. Execute the reasoning sequentially and verify consistency.
            .
            .
            .

📚 All Datasets Used

The dataset consists of high-quality, filtered reasoning distillation data:

| Dataset Name | Description / Purpose | |--------------|-----------------------| | nohurry/Opus-4.6-Reasoning-3000x-filtered | Provides comprehensive Claude 4.6 Opus reasoning trajectories. | | Roman1111111/claude-opus-4.6-10000x | Large-scale public Claude 4.6 Opus distillation data used to strengthen general reasoning transfer in v2. | | TeichAI/claude-4.5-opus-high-reasoning-250x | Injecting high-intensity, structured reasoning instances. | | Jackrong/Qwen3.5-reasoning-700x | Additional curated reasoning samples designed to strengthen structured step-by-step problem solving and improve reasoning diversity. |

⚠️ Limitations & Intended Use

  • Hallucination Risk: While reasoning is strong, the model remains an autoregressive LLM; external facts provided during the thinking sequence may occasionally contain hallucinations if verifying real-world events.
  • Intended Scenario: Best suited for offline analytical tasks, coding, math, and heavy logic-dependent prompting where the user needs to transparently follow the AI's internal logic.
  • This model is a test version intended solely for learning and demonstration purposes, and is for academic research and technical exploration use only.

🙏 Acknowledgements

Significant thanks to the Unsloth AI team for making rapid fine-tuning of large LLM models accessible. Additionally, we acknowledge Qwen internally, and the open-source community developers producing exceptional distilled datasets.

Author: Jackrong

Likes: 5

Downloads: 0

Tags: safetensors, qwen3_5, unsloth, qwen, qwen3.5, reasoning, chain-of-thought, lora, image-text-to-text, conversational, en, zh, ko, dataset:nohurry/Opus-4.6-Reasoning-3000x-filtered, dataset:Jackrong/Qwen3.5-reasoning-700x, dataset:Roman1111111/claude-opus-4.6-10000x, base_model:Qwen/Qwen3.5-27B, base_model:adapter:Qwen/Qwen3.5-27B, license:apache-2.0, region:us

EdgeTypE/Gokturk_OCR_Test


language:

  • tr
  • otk tags:
  • ocr
  • old-turkic
  • gokturk
  • resnet
  • onnx
  • computer-vision
  • image-classification metrics:
  • accuracy

Work in progress.

Gokturk ResNet OCR Model

This is a ResNet-based OCR model specifically trained to recognize Old Turkic (Gokturk) script characters. It is optimized for inference using ONNX Runtime, making it highly portable and efficient.

Model Description

  • Repository: OldTurkicOCR
  • Task: Optical Character Recognition (OCR) / Image Classification
  • Classes: 75 characters (Unicode range U+10C00U+10C4A)
  • Input Shape: (Batch, 64, 64, 1) (Grayscale)
  • Format: ONNX

How to use

This model is primarily used within the OldTurkicOCR Rust project.

In Rust (with ort crate)

use ort::session::Session;

let session = Session::builder()?
    .commit_from_file("gokturk_resnet_v1.onnx")?;

In Python (with onnxruntime)

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("gokturk_resnet_v1.onnx")
# Expects a 64x64 grayscale image normalized to [0, 1]
# input_data shape: (1, 64, 64, 1)
result = session.run(None, {"input_1": input_data})

Dataset and Training

The model was trained on a curated dataset of Gokturk script characters, covering various styles and weights of the Orkhon and Yenisei variants.

Files

Author: EdgeTypE

Likes: 2

Downloads: 0

Tags: onnx, ocr, old-turkic, gokturk, resnet, computer-vision, image-classification, tr, otk, region:us

dataslab/DLM-NL2JSON-4B


language:

  • ko license: apache-2.0 tags:
  • task-specific
  • structured-prediction
  • korean
  • public-sector
  • qwen3
  • domain-specific
  • merge base_model: Qwen/Qwen3-4B datasets: [] pipeline_tag: text-generation model-index:
  • name: DLM-NL2JSON-4B results:
    • task: type: structured-prediction name: Korean NL-to-JSON Schema Extraction dataset: type: custom name: Busan Public Data Query Test Set args: num_samples: 2041 metrics:
      • type: exact_match value: 94.4 name: Exact Match Accuracy (raw)
      • type: exact_match value: 96.8 name: Exact Match Accuracy (adjusted)

DLM-NL2JSON-4B

A 4B-parameter service-specific LLM that outperforms GPT-4o (+14%p) and Qwen3.5-35B (+22%p) on structured JSON extraction from Korean natural language queries.

DLM (Domain-specific Language Model) is a series of task-specialized models by Data Science Lab., Ltd.. This model is a LoRA-merged Qwen3-4B fine-tuned for structured JSON extraction in the Busan Metropolitan City public data analytics service.

Key Results

Evaluated on 2,041 test samples across 10 task categories (field-level exact match, summary excluded):

| Model | Params | Accuracy | Accuracy (adj*) | Avg Latency | |-------|--------|----------|-----------------|-------------| | DLM-NL2JSON-4B | 4B | 94.4% | 96.8% | 2.59s | | GPT-4o | ~200B+ | 80.5% | 82.5% | 1.58s | | Qwen3.5-35B-A3B | 35B | 72.2% | 73.9% | 0.85s |

*adj: 64 CSM samples with known gold label noise excluded (see Evaluation section)

Per-Category Breakdown

| Category | N | DLM-NL2JSON-4B | GPT-4o | Qwen3.5-35B | |----------|---|-------------|--------|-------------| | ALP-A (population pattern) | 250 | 99.6% | 56.0% | 47.6% | | ALP-B (population flow) | 250 | 98.4% | 50.4% | 46.8% | | CSM (consumer spending) | 700 | 90.6% | 90.1% | 86.1% | | CREDIT-Income | 58 | 94.8% | 53.4% | 34.5% | | CREDIT-Spending | 77 | 97.4% | 92.2% | 51.9% | | CREDIT-Loan/Default | 73 | 98.6% | 94.5% | 72.6% | | CPI (business status) | 219 | 86.3% | 87.2% | 54.8% | | GIS-Inflow | 72 | 97.2% | 79.2% | 93.1% | | GIS-Outflow | 62 | 98.4% | 77.4% | 98.4% | | GIS-Consumption | 280 | 98.2% | 99.6% | 97.5% |

DLM-NL2JSON-4B wins 8 out of 10 categories, with the largest gains on ALP (+43%p vs GPT-4o) and CREDIT-Income (+41%p).

Important: This is a Service-Specific Model

This model is NOT a general-purpose NL-to-JSON converter. It is trained exclusively for a fixed set of predefined schemas used in a specific production service. It will not generalize to arbitrary JSON schemas or different prompt formats.

To use this model correctly, you must:

  1. Use the exact system prompts it was trained on (one per task category — see Usage section)
  2. Include the corresponding special token (<TASK_CSM>, <TASK_CREDIT>, <TASK_GIS>, <TASK_ALP>, <TASK_CPI>) in the input
  3. Expect output conforming only to the predefined schemas listed below

Why publish a service-specific model? This model serves as a reference implementation demonstrating that task-specific LoRA fine-tuning on a 4B model can dramatically outperform GPT-4o and larger open-source models on constrained structured output tasks. We believe the DLM (Domain-specific Language Model) approach — training small, cheap-to-serve models for specific service endpoints — is an underexplored but highly practical paradigm.

Intended Use

This model converts Korean natural language queries about public/economic data into structured JSON conforming to its predefined schemas. It is designed for and deployed in the Busan Metropolitan City Big Data Wave analytics dashboard.

Input: Free-form Korean query + task-specific system prompt

Output: Single-line JSON with exact schema compliance:

{"summary":"##2025년 5월 부산광역시 해운대구 유통/의료 소비분석##","base_ym":202505,"region_nm":"부산광역시 해운대구","industry_select":{"3":[],"8":[]},"sex_cd":[1],"age_cd":[30],"category":2}

Task Categories

| ID | Name | Schema Type | |----|------|-------------| | 0 | ALP-A | Population pattern (ptrn: residence/work/visit) | | 1 | ALP-B | Population flow (flow_cd: inflow/outflow) | | 2 | CSM | Consumer spending by industry | | 3 | CREDIT-Income | Income statistics | | 4 | CREDIT-Spending | Spending statistics | | 5 | CREDIT-Loan | Loan/default statistics | | 6 | CPI | Business/enterprise status | | 9 | GIS-Inflow | Geographic inflow analysis | | 10 | GIS-Outflow | Geographic outflow analysis | | 11 | GIS-Consumption | Geographic consumption analysis |

Training Details

| Item | Value | |------|-------| | Base model | Qwen/Qwen3-4B | | Method | LoRA SFT → merged full model | | Training samples | 16,292 (Korean) | | Validation samples | 2,034 | | Special tokens | <TASK_CSM>, <TASK_CREDIT>, <TASK_GIS>, <TASK_ALP>, <TASK_CPI> | | Max sequence length | 6,144 | | Architecture | Qwen3ForCausalLM (36 layers, 2560 hidden, 32 heads) |

Training data consists of synthetically generated Korean natural language queries paired with structured JSON outputs, covering the Busan public data analytics domain.

Evaluation Methodology

  • Metric: Field-level exact match — each JSON key's value is compared against the gold label. The summary field is excluded from comparison.
  • Test set: 2,041 samples, stratified by category
  • Gold label noise: 64/700 CSM samples have age_cd capped at [10..60] instead of [10..70] for "all ages" queries, conflicting with the prompt specification. These affect all models equally and are excluded in the adjusted metric.
  • Train/Test overlap: 16/2,041 input strings (0.78%) appear in both sets — retained for consistency.
  • All models received identical system prompts per category.

Hardware

| Model | Serving | GPU | |-------|---------|-----| | DLM-NL2JSON-4B | TensorRT-LLM | NVIDIA L4 24GB | | GPT-4o | OpenAI API | N/A | | Qwen3.5-35B-A3B | vLLM | NVIDIA A6000 48GB |

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "dataslab/DLM-NL2JSON-4B"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)

# System prompt (example: CSM consumer spending schema — abbreviated for readability)
# Full prompts per category are available in the repository's eval/prompts.py
system_prompt = """너는 반드시 **JSON 한 줄**만 출력한다. 설명/텍스트/코멘트/마크다운/코드블록/이모지/공백 줄 금지.
출력은 항상 { 로 시작하고 } 로 끝난다.

[스키마: TASK_CSM] (키/타입/순서 엄수)
{"summary":string,"base_ym":int,"region_nm":string,"industry_select":object,"sex_cd":[int],"age_cd":[int],"category":2}

[기본값]
- base_ym: 0, region_nm: "부산광역시"
- industry_select: 업종 미지정 시 전 대분류 키를 []로 설정
- sex_cd: [0,1], age_cd: [10,20,30,40,50,60,70]
- category: 항상 2

[대분류 코드표] 1:여행/숙박 2:여가/문화 3:유통 4:음식/주점 5:음식료품
6:의류/잡화 7:미용 8:의료 9:교육 10:생활 11:자동차"""

# Note: special token <TASK_CSM> must be included in the user message
user_query = "<TASK_CSM> 2024년 1월 해운대구 중동 의류/잡화랑 뷰티 쪽 남성 20~40대 위주로 알려줘"

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_query}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.0, do_sample=False)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True))
# {"summary":"##2024년 1월 부산광역시 해운대구 중동 의류/잡화/미용 소비분석##","base_ym":202401,"region_nm":"부산광역시 해운대구 중동","industry_select":{"6":[],"7":[]},"sex_cd":[0],"age_cd":[20,30,40],"category":2}
# Note: "뷰티" → mapped to 미용(code 7), "해운대구 중동" → normalized to "부산광역시 해운대구 중동"

vLLM / OpenAI-compatible serving

from openai import OpenAI

client = OpenAI(base_url="http://your-server:8006/v1", api_key="token")
resp = client.chat.completions.create(
    model="DLM-NL2JSON-4B",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": "<TASK_CSM> 2024년 1월 해운대구 중동 의류/잡화랑 뷰티 쪽 남성 20~40대 위주로 알려줘"}
    ],
    max_tokens=512,
    temperature=0.0,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}}  # disable thinking mode
)
print(resp.choices[0].message.content)

Important: When serving with vLLM/TensorRT-LLM, pass chat_template_kwargs: {"enable_thinking": false} to disable the Qwen3 thinking mode. Otherwise, reasoning tokens will consume the output budget and truncate the JSON.

Known Limitations

  1. CPI category (86.3%) is the weakest — complex industry classification codes (A~U with sub-codes) are harder to extract.
  2. CSM training data noise: ~8% of CSM training samples have age_cd capped at 60 instead of 70 for "all ages" queries, introducing inconsistency.
  3. Domain-specific only: This model is trained exclusively for the Busan public data schema extraction task. It has no general-purpose capabilities and should not be used as a general chatbot.
  4. Korean only: All training data and prompts are in Korean.

Citation

If you use this model, please cite:

@misc{dsl-dlm-nl2json-4b,
  title={DLM-NL2JSON-4B: A Domain-Specific Language Model for Korean Public Data Schema Extraction},
  author={Data Science Lab., Ltd.},
  year={2026},
  url={https://huggingface.co/dataslab/DLM-NL2JSON-4B}
}

Contact

  • Organization: Data Science Lab., Ltd.
  • Project: Busan Metropolitan City Big Data Wave

Author: dataslab

Likes: 2

Downloads: 36

Tags: safetensors, qwen3, task-specific, structured-prediction, korean, public-sector, domain-specific, merge, text-generation, conversational, ko, base_model:Qwen/Qwen3-4B, base_model:finetune:Qwen/Qwen3-4B, license:apache-2.0, model-index, region:us

koifishlabs/Kurogo-4B-JP


language:

  • ja
  • en license: apache-2.0 tags:
  • qwen
  • gguf
  • japanese
  • finetuned base_model: Qwen/Qwen3.5-4B pipeline_tag: text-generation

Kurogo-4B-JP (GGUF)

Finetuned Qwen 3.5 4B optimized for Japanese knowledge and conversational Q&A. Built for Kurogo, a fully offline AI assistant app.

Available Files

| File | Quantization | Size | |------|-------------|------| | Qwen3.5-4B-Kurogo-Q4_K_M.gguf | Q4_K_M | 2.5 GB | | Qwen3.5-4B-Kurogo-Q3_K_M.gguf | Q3_K_M | 2.1 GB |

Usage

Works with llama.cpp, llama-cpp-python, and llama.rn (React Native):

llama-cli -m Qwen3.5-4B-Kurogo-Q3_K_M.gguf -p "<|im_start|>user\n日本の首都はどこですか?<|im_end|>\n<|im_start|>assistant\n" -ngl 99

Attribution

This model is a derivative work based on Qwen 3.5 4B by the Qwen Team, Alibaba Group. The original model weights were modified through LoRA finetuning. The original model is licensed under Apache 2.0.

License

Apache 2.0 — see LICENSE for details.

Author: koifishlabs

Likes: 1

Downloads: 0

Tags: gguf, qwen, japanese, finetuned, text-generation, ja, en, base_model:Qwen/Qwen3.5-4B, base_model:quantized:Qwen/Qwen3.5-4B, license:apache-2.0, endpoints_compatible, region:us, conversational

taetae77/smolvla_test_policy


base_model: lerobot/smolvla_base datasets: taetae77/lerobot-test1 library_name: lerobot license: apache-2.0 model_name: smolvla pipeline_tag: robotics tags:

  • robotics
  • lerobot
  • smolvla

Model Card for smolvla

<!-- Provide a quick summary of what the model is/does. -->

SmolVLA is a compact, efficient vision-language-action model that achieves competitive performance at reduced computational costs and can be deployed on consumer-grade hardware.

This policy has been trained and pushed to the Hub using LeRobot. See the full documentation at LeRobot Docs.


How to Get Started with the Model

For a complete walkthrough, see the training guide. Below is the short version on how to train and run inference/eval:

Train from scratch

lerobot-train \
  --dataset.repo_id=${HF_USER}/<dataset> \
  --policy.type=act \
  --output_dir=outputs/train/<desired_policy_repo_id> \
  --job_name=lerobot_training \
  --policy.device=cuda \
  --policy.repo_id=${HF_USER}/<desired_policy_repo_id>
  --wandb.enable=true

Writes checkpoints to outputs/train/<desired_policy_repo_id>/checkpoints/.

Evaluate the policy/run inference

lerobot-record \
  --robot.type=so100_follower \
  --dataset.repo_id=<hf_user>/eval_<dataset> \
  --policy.path=<hf_user>/<desired_policy_repo_id> \
  --episodes=10

Prefix the dataset repo with eval_ and supply --policy.path pointing to a local or hub checkpoint.


Model Details

  • License: apache-2.0

Author: taetae77

Likes: 1

Downloads: 0

Tags: lerobot, safetensors, robotics, smolvla, dataset:taetae77/lerobot-test1, arxiv:2506.01844, base_model:lerobot/smolvla_base, base_model:finetune:lerobot/smolvla_base, license:apache-2.0, region:us

jeeasz/Shot-1.5B-Chat


license: apache-2.0

Author: jeeasz

Likes: 1

Downloads: 0

Tags: license:apache-2.0, region:us

wangzhang/LFM2-24B-A2B-abliterated


base_model: LiquidAI/LFM2-24B-A2B language:

  • en
  • zh license: apache-2.0 tags:
  • prometheus
  • uncensored
  • decensored
  • abliterated
  • liquid
  • moe

LFM2-24B-A2B-abliterated

Unrestricted version of LiquidAI/LFM2-24B-A2B, created using Prometheus.

This is the first abliterated model based on Liquid AI's hybrid gated short convolution + grouped query attention architecture with Mixture of Experts.

Model Details

| Property | Value | |----------|-------| | Base Model | LiquidAI/LFM2-24B-A2B | | Architecture | Hybrid Conv + GQA with MoE (64 experts, top-4 routing) | | Parameters | 24B total / 2.3B active per token | | Layers | 40 (10 attention + 30 convolution) | | Hidden Size | 2048 | | Context Length | 128K tokens | | Precision | BF16 |

Performance

| Metric | This model | Original | |--------|-----------|----------| | KL divergence | 0.0079 | 0 | | Refusals | 0/100 (0%) | 90/100 (90%) |

Evaluated with an LLM judge (Gemini Flash) on 100 harmful prompts. KL divergence of 0.0079 indicates the model's general capabilities are virtually identical to the original.

How It Was Made

  1. Computed refusal directions from 400 harmful vs 400 benign prompt pairs across all 40 layers
  2. Applied orthogonalized abliteration to isolate refusal-specific activation patterns
  3. Steered three component types independently: convolution output projections, attention output projections, and MLP/expert down-projections
  4. Profiled MoE expert activations across 38 router layers to identify safety-critical experts
  5. Applied hybrid MoE steering: router weight suppression (25 experts, bias=-0.41) + fused expert abliteration (weight=2.79)
  6. Optimized via Optuna TPE (trial #10 of 50, with 15 warmup trials)

This is notable as the first successful abliteration of a non-transformer hybrid architecture — LFM2's gated short convolution blocks required novel steering targets beyond standard attention/MLP pairs.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "wangzhang/LFM2-24B-A2B-abliterated",
    torch_dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("wangzhang/LFM2-24B-A2B-abliterated")

messages = [{"role": "user", "content": "Your question here"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Hardware Requirements

| Precision | VRAM | |-----------|------| | BF16 | ~48 GB (A100 80GB, H100) | | INT8 | ~24 GB (A40, RTX 4090) | | NF4 | ~12 GB (RTX 3090, RTX 4080) |

Note: This model requires a single GPU — the convolution layers do not support accelerate's multi-GPU device_map splitting.

Disclaimer

This model is intended for research purposes only. The removal of safety guardrails means the model will comply with requests that the original model would refuse. Users are responsible for ensuring their use complies with applicable laws and regulations.


Made with Prometheus

Author: wangzhang

Likes: 1

Downloads: 0

Tags: safetensors, lfm2_moe, prometheus, uncensored, decensored, abliterated, liquid, moe, en, zh, base_model:LiquidAI/LFM2-24B-A2B, base_model:finetune:LiquidAI/LFM2-24B-A2B, license:apache-2.0, region:us