Todays AI Summary

AI Developments: Code Generation, Video Understanding, and Reinforcement Learning Take Center Stage

Today's AI landscape is buzzing with activity across several key areas, including code generation, multimodal reasoning, and reinforcement learning. New models are pushing the boundaries of code completion and image generation, while research papers explore innovative approaches to video understanding, spatial reasoning, and performative policy optimization.

Research Highlights

  • LongVideoAgent: Multi-Agent Reasoning with Long Videos (Runtao Liu et al.): This paper introduces a multi-agent framework for long-video question answering. By coordinating a grounding agent for segment localization and a vision agent for textual observation extraction, the system achieves superior performance on episode-level datasets compared to non-agent baselines. The use of reinforcement learning further enhances reasoning and planning capabilities.
  • Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning (Seijin Kobayashi et al.): This research explores how to improve reinforcement learning by acting and exploring within the internal representations of an autoregressive model. By discovering temporally-abstract actions, the model learns to compress long activation sequence chunks onto internal controllers, leading to efficient exploration on novel tasks.
  • Cube Bench: A Benchmark for Spatial Visual Reasoning in MLLMs (Dhruv Anand, Ehsan Shareghi): This paper introduces a new benchmark for evaluating spatial and sequential reasoning in multimodal large language models (MLLMs) using Rubik's Cube challenges. The benchmark decomposes performance into five skills, revealing a performance gap between closed-source and open-weight models, and highlighting the challenges MLLMs face with increasing cube complexity.
  • Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs (Rui Pan, Zhuofu Chen, Ravi Netravali): This paper introduces FailFast, a dLLM-based speculative decoding framework that dynamically adapts its speculation length. It achieves up to 4.9x speedup over vanilla decoding, 1.7x over the best naive dLLM drafter, and 1.4x over EAGLE-3 across diverse models and workloads.

Model Spotlight

  • Maincode/Maincoder-1B: This 1 billion parameter model is designed for code generation and completion tasks, with a focus on Python. It utilizes a modern transformer architecture and is fine-tuned with a specialized reinforcement learning policy optimization algorithm (MCPO). Maincoder-1B achieves state-of-the-art performance on Python coding benchmarks like HumanEval, HumanEval+, and MBPP+.
  • zai-org/GLM-4.7-AWQ: This is a quantized version of the GLM-4.7 model, optimized for coding, tool use, and complex reasoning. GLM-4.7 introduces features like Interleaved Thinking, Preserved Thinking, and Turn-level Thinking to improve instruction following and generation quality. It demonstrates significant improvements on benchmarks such as SWE-bench, Terminal Bench 2.0, and HLE.
  • stepfun-ai/NextStep-1.1-Pretrain: This model represents a significant advancement in the NextStep series for text-to-image generation. It addresses visualization failures from previous versions and enhances image quality through extended training and a Flow-based Reinforcement Learning (RL) post-training paradigm.

Key Takeaways

  • Specialized Models Excel: Models like Maincoder-1B demonstrate the power of specialized architectures and training techniques for specific tasks like code generation.
  • Multimodal Reasoning is Gaining Traction: The LongVideoAgent paper and Cube Bench benchmark highlight the growing importance of AI systems that can reason across both visual and textual information.
  • Reinforcement Learning Enhances Performance: Both the LongVideoAgent paper and the Maincoder-1B model showcase the benefits of using reinforcement learning to improve AI agent planning, code generation, and overall performance.
  • Efficiency Remains a Key Focus: The GLM-4.7-AWQ model and the Fail Fast paper demonstrate ongoing efforts to improve the efficiency of large language models through quantization and innovative decoding strategies.

AI Papers for 2026-02-02

RedSage: A Cybersecurity Generalist LLM

Cybersecurity operations demand assistant LLMs that support diverse workflows without exposing sensitive data. Existing solutions either rely on proprietary APIs with privacy risks or on open models lacking domain adaptation. To bridge this gap, we curate 11.8B tokens of cybersecurity-focused continual pretraining data via large-scale web filtering and manual collection of high-quality resources, spanning 28.6K documents across frameworks, offensive techniques, and security tools. Building on this, we design an agentic augmentation pipeline that simulates expert workflows to generate 266K multi-turn cybersecurity samples for supervised fine-tuning. Combined with general open-source LLM data, these resources enable the training of RedSage, an open-source, locally deployable cybersecurity assistant with domain-aware pretraining and post-training. To rigorously evaluate the models, we introduce RedSage-Bench, a benchmark with 30K multiple-choice and 240 open-ended Q&A items covering cybersecurity knowledge, skills, and tool expertise. RedSage is further evaluated on established cybersecurity benchmarks (e.g., CTI-Bench, CyberMetric, SECURE) and general LLM benchmarks to assess broader generalization. At the 8B scale, RedSage achieves consistently better results, surpassing the baseline models by up to +5.59 points on cybersecurity benchmarks and +5.05 points on Open LLM Leaderboard tasks. These findings demonstrate that domain-aware agentic augmentation and pre/post-training can not only enhance cybersecurity-specific expertise but also help to improve general reasoning and instruction-following. All models, datasets, and code are publicly available.

Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts

Hybrid Transformer architectures, which combine softmax attention blocks and recurrent neural networks (RNNs), have shown a desirable performance-throughput tradeoff for long-context modeling, but their adoption and studies are hindered by the prohibitive cost of large-scale pre-training from scratch. Some recent studies have shown that pre-trained softmax attention blocks can be converted into RNN blocks through parameter transfer and knowledge distillation. However, these transfer methods require substantial amounts of training data (more than 10B tokens), and the resulting hybrid models also exhibit poor long-context performance, which is the scenario where hybrid models enjoy significant inference speedups over Transformer-based models. In this paper, we present HALO (Hybrid Attention via Layer Optimization), a pipeline for distilling Transformer models into RNN-attention hybrid models. We then present HypeNet, a hybrid architecture with superior length generalization enabled by a novel position encoding scheme (named HyPE) and various architectural modifications. We convert the Qwen3 series into HypeNet using HALO, achieving performance comparable to the original Transformer models while enjoying superior long-context performance and efficiency. The conversion requires just 2.3B tokens, less than 0.01% of their pre-training data

Exploring Reasoning Reward Model for Agents

Agentic Reinforcement Learning (Agentic RL) has achieved notable success in enabling agents to perform complex reasoning and tool use. However, most methods still relies on sparse outcome-based reward for training. Such feedback fails to differentiate intermediate reasoning quality, leading to suboptimal training results. In this paper, we introduce Agent Reasoning Reward Model (Agent-RRM), a multi-faceted reward model that produces structured feedback for agentic trajectories, including (1) an explicit reasoning trace , (2) a focused critique that provides refinement guidance by highlighting reasoning flaws, and (3) an overall score that evaluates process performance. Leveraging these signals, we systematically investigate three integration strategies: Reagent-C (text-augmented refinement), Reagent-R (reward-augmented guidance), and Reagent-U (unified feedback integration). Extensive evaluations across 12 diverse benchmarks demonstrate that Reagent-U yields substantial performance leaps, achieving 43.7% on GAIA and 46.2% on WebWalkerQA, validating the effectiveness of our reasoning reward model and training schemes. Code, models, and datasets are all released to facilitate future research.

DynaWeb: Model-Based Reinforcement Learning of Web Agents

The development of autonomous web agents, powered by Large Language Models (LLMs) and reinforcement learning (RL), represents a significant step towards general-purpose AI assistants. However, training these agents is severely hampered by the challenges of interacting with the live internet, which is inefficient, costly, and fraught with risks. Model-based reinforcement learning (MBRL) offers a promising solution by learning a world model of the environment to enable simulated interaction. This paper introduces DynaWeb, a novel MBRL framework that trains web agents through interacting with a web world model trained to predict naturalistic web page representations given agent actions. This model serves as a synthetic web environment where an agent policy can dream by generating vast quantities of rollout action trajectories for efficient online reinforcement learning. Beyond free policy rollouts, DynaWeb incorporates real expert trajectories from training data, which are randomly interleaved with on-policy rollouts during training to improve stability and sample efficiency. Experiments conducted on the challenging WebArena and WebVoyager benchmarks demonstrate that DynaWeb consistently and significantly improves the performance of state-of-the-art open-source web agent models. Our findings establish the viability of training web agents through imagination, offering a scalable and efficient way to scale up online agentic RL.

Routing the Lottery: Adaptive Subnetworks for Heterogeneous Data

In pruning, the Lottery Ticket Hypothesis posits that large networks contain sparse subnetworks, or winning tickets, that can be trained in isolation to match the performance of their dense counterparts. However, most existing approaches assume a single universal winning ticket shared across all inputs, ignoring the inherent heterogeneity of real-world data. In this work, we propose Routing the Lottery (RTL), an adaptive pruning framework that discovers multiple specialized subnetworks, called adaptive tickets, each tailored to a class, semantic cluster, or environmental condition. Across diverse datasets and tasks, RTL consistently outperforms single- and multi-model baselines in balanced accuracy and recall, while using up to 10 times fewer parameters than independent models and exhibiting semantically aligned. Furthermore, we identify subnetwork collapse, a performance drop under aggressive pruning, and introduce a subnetwork similarity score that enables label-free diagnosis of oversparsification. Overall, our results recast pruning as a mechanism for aligning model structure with data heterogeneity, paving the way toward more modular and context-aware deep learning.

Reasoning While Asking: Transforming Reasoning Large Language Models from Passive Solvers to Proactive Inquirers

Reasoning-oriented Large Language Models (LLMs) have achieved remarkable progress with Chain-of-Thought (CoT) prompting, yet they remain fundamentally limited by a \emph{blind self-thinking} paradigm: performing extensive internal reasoning even when critical information is missing or ambiguous. We propose Proactive Interactive Reasoning (PIR), a new reasoning paradigm that transforms LLMs from passive solvers into proactive inquirers that interleave reasoning with clarification. Unlike existing search- or tool-based frameworks that primarily address knowledge uncertainty by querying external environments, PIR targets premise- and intent-level uncertainty through direct interaction with the user. PIR is implemented via two core components: (1) an uncertainty-aware supervised fine-tuning procedure that equips models with interactive reasoning capability, and (2) a user-simulator-based policy optimization framework driven by a composite reward that aligns model behavior with user intent. Extensive experiments on mathematical reasoning, code generation, and document editing demonstrate that PIR consistently outperforms strong baselines, achieving up to 32.70\% higher accuracy, 22.90\% higher pass rate, and 41.36 BLEU improvement, while reducing nearly half of the reasoning computation and unnecessary interaction turns. Further reliability evaluations on factual knowledge, question answering, and missing-premise scenarios confirm the strong generalization and robustness of PIR. Model and code are publicly available at: \href{https://github.com/SUAT-AIRI/Proactive-Interactive-R1}

PRISM: Distribution-free Adaptive Computation of Matrix Functions for Accelerating Neural Network Training

Matrix functions such as square root, inverse roots, and orthogonalization play a central role in preconditioned gradient methods for neural network training. This has motivated the development of iterative algorithms that avoid explicit eigendecompositions and rely primarily on matrix multiplications, making them well suited for modern GPU accelerators. We present PRISM (Polynomial-fitting and Randomized Iterative Sketching for Matrix functions computation), a general framework for accelerating iterative algorithms for computing matrix functions. PRISM combines adaptive polynomial approximation with randomized sketching: at each iteration, it fits a polynomial surrogate to the current spectrum via a sketched least-squares problem, adapting to the instance at hand with minimal overhead. We apply PRISM to accelerate Newton-Schulz-like iterations for matrix square roots and orthogonalization, which are core primitives in machine learning. Unlike prior methods, PRISM requires no explicit spectral bounds or singular value estimates; and it adapts automatically to the evolving spectrum. Empirically, PRISM accelerates training when integrated into Shampoo and Muon optimizers.

StepShield: When, Not Whether to Intervene on Rogue Agents

Existing agent safety benchmarks report binary accuracy, conflating early intervention with post-mortem analysis. A detector that flags a violation at step 8 enables intervention; one that reports it at step 48 provides only forensic value. This distinction is critical, yet current benchmarks cannot measure it. We introduce StepShield, the first benchmark to evaluate when violations are detected, not just whether. StepShield contains 9,213 code agent trajectories, including 1,278 meticulously annotated training pairs and a 7,935-trajectory test set with a realistic 8.1% rogue rate. Rogue behaviors are grounded in real-world security incidents across six categories. We propose three novel temporal metrics: Early Intervention Rate (EIR), Intervention Gap, and Tokens Saved. Surprisingly, our evaluation reveals that an LLM-based judge achieves 59% EIR while a static analyzer achieves only 26%, a 2.3x performance gap that is entirely invisible to standard accuracy metrics. We further show that early detection has direct economic benefits: our cascaded HybridGuard detector reduces monitoring costs by 75% and projects to $108M in cumulative savings over five years at enterprise scale. By shifting the focus of evaluation from whether to when, StepShield provides a new foundation for building safer and more economically viable AI agents. The code and data are released under an Apache 2.0 license.

World of Workflows: a Benchmark for Bringing World Models to Enterprise Systems

Frontier large language models (LLMs) excel as autonomous agents in many domains, yet they remain untested in complex enterprise systems where hidden workflows create cascading effects across interconnected databases. Existing enterprise benchmarks evaluate surface-level agentic task completion similar to general consumer benchmarks, ignoring true challenges in enterprises, such as limited observability, large database state, and hidden workflows with cascading side effects. We introduce World of Workflows (WoW), a realistic ServiceNow-based environment incorporating 4,000+ business rules and 55 active workflows embedded in the system, alongside WoW-bench, a benchmark of 234 tasks evaluating constrained agentic task completion and enterprise dynamics modeling capabilities. We reveal two major takeaways: (1) Frontier LLMs suffer from dynamics blindness, consistently failing to predict the invisible, cascading side effects of their actions, which leads to silent constraint violations, and (2) reliability in opaque systems requires grounded world modeling, where agents must mentally simulate hidden state transitions to bridge the observability gap when high-fidelity feedback is unavailable. For reliable and useful enterprise agents, WoW motivates a new paradigm to explicitly learn system dynamics. We release our GitHub for setting up and evaluating WoW.

SWE-Replay: Efficient Test-Time Scaling for Software Engineering Agents

Test-time scaling has been widely adopted to enhance the capabilities of Large Language Model (LLM) agents in software engineering (SWE) tasks. However, the standard approach of repeatedly sampling trajectories from scratch is computationally expensive. While recent methods have attempted to mitigate costs using specialized value agents, they can suffer from model miscalibration and fail to generalize to modern agents that synthesize custom bash scripts as tools. In this paper, we introduce SWE-Replay, the first efficient and generalizable test-time scaling technique for modern agents without reliance on potentially noisy value estimates. SWE-Replay optimizes the scaling process by recycling trajectories from prior trials, dynamically choosing to either explore from scratch or exploit archived experience by branching at critical intermediate steps. This selection of intermediate steps is driven by the potential and reasoning significance of repository exploration, rather than external LLM-based quality estimates. Our evaluation shows that, on SWE-Bench Verified, SWE-Replay consistently outperforms naive scaling, reducing costs by up to 17.4% while maintaining or even improving performance by up to 3.8%. Further evaluation on SWE-Bench Pro and Multilingual validates the generalizability of SWE-Replay, establishing it as a robust foundation for efficient test-time scaling of software engineering agents.

AI Models

Bedovyy/Anima-FP8


license: other license_name: circlestone-labs-non-commercial-license license_link: https://huggingface.co/circlestone-labs/Anima/blob/main/LICENSE.md base_model:

  • circlestone-labs/Anima base_model_relation: quantized

FP8 Quantized model of ANIMA

There are two models - FP8 and FP8Mixed.

FP8Mixed quantized fewer layers, so expects better quality.

Generation speed

Tested on

  • RTX5090 (400W), ComfyUI with torch2.10.0+cu130
  • Generates 832x1216, 30steps, cfg 4.0, er sde, simple

| Dtype (Quant) | Sage Attn | it/s | Time (s) | Relative Speedup (%) | | ------------- | --------- | ---- | -------- | -------------------- | | bf16 | X | 4.33 | 7.07 | 0% | | bf16 | O | 4.66 | 6.59 | +7.6% | | fp8mixed | X | 4.79 | 6.42 | +10.6% | | fp8mixed | O | 5.25 | 5.87 | +21.3% | | fp8 | X | 4.94 | 6.23 | +14.1% | | fp8 | O | 5.40 | 5.72 | +24.7% |

Sample

| bf16 | fp8mixed | fp8 | |------|----------|-----| |<img src="https://cdn-uploads.huggingface.co/production/uploads/63fbf6951b4b1bd4e706fed1/4OzIFBhl_FipyILF3ZXyb.png" width="208">|<img src="https://cdn-uploads.huggingface.co/production/uploads/63fbf6951b4b1bd4e706fed1/GiVW4VHm1AnMB73q1mv1w.png" width="208">|<img src="https://cdn-uploads.huggingface.co/production/uploads/63fbf6951b4b1bd4e706fed1/Bk0o1slEWvdbO9NynBWGr.png" width="208">| |<img src="https://cdn-uploads.huggingface.co/production/uploads/63fbf6951b4b1bd4e706fed1/_u2hrgEulSMOPiNux9bYI.png" width="208">|<img src="https://cdn-uploads.huggingface.co/production/uploads/63fbf6951b4b1bd4e706fed1/dWEUFK3lFLK8_wsiRDMzJ.png" width="208">|<img src="https://cdn-uploads.huggingface.co/production/uploads/63fbf6951b4b1bd4e706fed1/b769101lzVGLjPIGJApJp.png" width="208">|

Quantized layers

fp8

{
  "format": "comfy_quant",
  "block_names": [""],
  "rules": [
    { "policy": "keep", "match": ["blocks.0", "blocks.27"] },
    { "policy": "float8_e4m3fn", "match": ["q_proj", "k_proj", "v_proj", "o_proj", "output_proj", "mlp"] },
    { "policy": "nvfp4", "match": [] }
  ]
}

fp8mixed

{
  "format": "comfy_quant",
  "block_names": ["net.blocks."],
  "rules": [
    { "policy": "keep", "match": ["blocks.0.", "blocks.1.", "blocks.27.", "adaln_modulation", "v_proj"] },
    { "policy": "float8_e4m3fn", "match": ["q_proj", "k_proj", "output_proj", "mlp"] },
    { "policy": "nvfp4", "match": [] }
  ]
}

Author: Bedovyy

Likes: 5

Downloads: 0

Tags: base_model:circlestone-labs/Anima, base_model:quantized:circlestone-labs/Anima, license:other, region:us

deepcrayon/AniMUL-v1


license: apache-2.0 datasets:

  • EarthSpeciesProject/NatureLM-audio-training base_model:
  • Qwen/Qwen3-Omni-30B-A3B-Instruct pipeline_tag: audio-classification tags:
  • animal
  • biology
  • interspecies-communication
  • nature
  • species

AniMUL-v1

AniMUL is a model for interspecies communication.

Use

AniMUL can do species classification from audio files. It can generally describe the sounds of an audio file.

It does not generate audio of species.

Upstream

AniMUL is a fine tune of Alibaba's Qwen3-Omni model.

  • https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct

AniMUL uses the Earth Species Project's NatureLM dataset of 26 million audio-text pairs.

  • https://www.earthspecies.org/
  • https://huggingface.co/datasets/EarthSpeciesProject/NatureLM-audio-training

This project is unofficial and not related to the upstream projects.

Source Code

Source code for the fine tuning, test inference, and a web interface is available:

  • https://spacecruft.org/deepcrayon/AniMUL
  • https://spacecruft.org/deepcrayon/AniMUL-server

Since the model is a fine tune of Qwen3-Omni, you can refer to that model for other deployment details.

License

The AniMUL model itself is under the Apache 2.0 license.

The upstream Qwen3-Omni model is under the Apache 2.0 license.

Data in the dataset from the Earth Species Project that was used for training is under a variety of licenses, including non-open source licenses such as Creative Commons non-commercial (NC) licenses. The Earth Species Project dataset page has this note:

"Due to its composite nature, NatureLM-audio-training is subject to multiple licenses. Individual samples have the 'license' field indicating the specific license for that sample. The dataset is not intended for commercial use, and users should adhere to the licenses of the individual datasets."

Developer

Jeff Moe moe@spacecruft.org

Loveland, Colorado, USA

Author: deepcrayon

Likes: 2

Downloads: 0

Tags: safetensors, qwen3_omni_moe, animal, biology, interspecies-communication, nature, species, audio-classification, dataset:EarthSpeciesProject/NatureLM-audio-training, base_model:Qwen/Qwen3-Omni-30B-A3B-Instruct, base_model:finetune:Qwen/Qwen3-Omni-30B-A3B-Instruct, license:apache-2.0, region:us

gety-ai/gety-embed-v0


base_model: intfloat/multilingual-e5-small library_name: sentence-transformers pipeline_tag: sentence-similarity license: mit language:

  • multilingual tags:
  • sentence-transformers
  • sentence-similarity
  • onnx
  • bert

gety-embed-v0

Fine-tuned from intfloat/multilingual-e5-small using open-source and proprietary synthetic data, optimized for local search scenarios.

ONNX

| File | Quantization | Size | |------|--------------|------| | onnx/model_uint8.onnx | UINT8 Dynamic | 112 MB |

Author: gety-ai

Likes: 2

Downloads: 0

Tags: sentence-transformers, onnx, bert, sentence-similarity, multilingual, base_model:intfloat/multilingual-e5-small, base_model:quantized:intfloat/multilingual-e5-small, license:mit, text-embeddings-inference, endpoints_compatible, region:us

mradermacher/nsfwvision-qwen3-vl-8b-v1-safetensors-i1-GGUF


base_model: GitMylo/nsfwvision-qwen3-vl-8b-v1-safetensors language:

  • en library_name: transformers mradermacher: readme_rev: 1 quantized_by: mradermacher tags:
  • llama-factory

About

<!-- ### quantize_version: 2 --> <!-- ### output_tensor_quantised: 1 --> <!-- ### convert_type: hf --> <!-- ### vocab_type: --> <!-- ### tags: nicoboss --> <!-- ### quants: Q2_K IQ3_M Q4_K_S IQ3_XXS Q3_K_M small-IQ4_NL Q4_K_M IQ2_M Q6_K IQ4_XS Q2_K_S IQ1_M Q3_K_S IQ2_XXS Q3_K_L IQ2_XS Q5_K_S IQ2_S IQ1_S Q5_K_M Q4_0 IQ3_XS Q4_1 IQ3_S --> <!-- ### quants_skip: --> <!-- ### skip_mmproj: -->

weighted/imatrix quants of https://huggingface.co/GitMylo/nsfwvision-qwen3-vl-8b-v1-safetensors

<!-- provided-files -->

For a convenient overview and download list, visit our model page for this model.

static quants are available at https://huggingface.co/mradermacher/nsfwvision-qwen3-vl-8b-v1-safetensors-GGUF

This is a vision model - mmproj files (if any) will be in the static repository.

Usage

If you are unsure how to use GGUF files, refer to one of TheBloke's READMEs for more details, including on how to concatenate multi-part files.

Provided Quants

(sorted by size, not necessarily quality. IQ-quants are often preferable over similar sized non-IQ quants)

| Link | Type | Size/GB | Notes | |:-----|:-----|--------:|:------| | GGUF | imatrix | 0.1 | imatrix file (for creating your own quants) | | GGUF | i1-IQ1_S | 2.2 | for the desperate | | GGUF | i1-IQ1_M | 2.4 | mostly desperate | | GGUF | i1-IQ2_XXS | 2.6 | | | GGUF | i1-IQ2_XS | 2.8 | | | GGUF | i1-IQ2_S | 3.0 | | | GGUF | i1-IQ2_M | 3.2 | | | GGUF | i1-Q2_K_S | 3.2 | very low quality | | GGUF | i1-Q2_K | 3.4 | IQ3_XXS probably better | | GGUF | i1-IQ3_XXS | 3.5 | lower quality | | GGUF | i1-IQ3_XS | 3.7 | | | GGUF | i1-Q3_K_S | 3.9 | IQ3_XS probably better | | GGUF | i1-IQ3_S | 3.9 | beats Q3_K* | | GGUF | i1-IQ3_M | 4.0 | | | GGUF | i1-Q3_K_M | 4.2 | IQ3_S probably better | | GGUF | i1-Q3_K_L | 4.5 | IQ3_M probably better | | GGUF | i1-IQ4_XS | 4.7 | | | GGUF | i1-Q4_0 | 4.9 | fast, low quality | | GGUF | i1-IQ4_NL | 4.9 | prefer IQ4_XS | | GGUF | i1-Q4_K_S | 4.9 | optimal size/speed/quality | | GGUF | i1-Q4_K_M | 5.1 | fast, recommended | | GGUF | i1-Q4_1 | 5.3 | | | GGUF | i1-Q5_K_S | 5.8 | | | GGUF | i1-Q5_K_M | 6.0 | | | GGUF | i1-Q6_K | 6.8 | practically like static Q6_K |

Here is a handy graph by ikawrakow comparing some lower-quality quant types (lower is better):

image.png

And here are Artefact2's thoughts on the matter: https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9

FAQ / Model Request

See https://huggingface.co/mradermacher/model_requests for some answers to questions you might have and/or if you want some other model quantized.

Thanks

I thank my company, nethype GmbH, for letting me use its servers and providing upgrades to my workstation to enable this work in my free time. Additional thanks to @nicoboss for giving me access to his private supercomputer, enabling me to provide many more imatrix quants, at much higher quality, than I would otherwise be able to.

<!-- end -->

Author: mradermacher

Likes: 2

Downloads: 0

Tags: transformers, gguf, llama-factory, en, base_model:GitMylo/nsfwvision-qwen3-vl-8b-v1-safetensors, base_model:quantized:GitMylo/nsfwvision-qwen3-vl-8b-v1-safetensors, endpoints_compatible, region:us, imatrix, conversational

EvoNet/EvoNet-3B-v0.1-Beta-GGUF

Author: EvoNet

Likes: 2

Downloads: 0

Tags: gguf, endpoints_compatible, region:us, conversational

TeichAI/Qwen3-4B-RA-SFT-Polaris-Alpha-Distill


base_model: Gen-Verse/Qwen3-4B-RA-SFT tags:

  • text-generation-inference
  • transformers
  • unsloth
  • qwen3 license: apache-2.0 language:
  • en datasets:
  • TeichAI/polaris-alpha-1000x

Uploaded finetuned model

  • Developed by: TeichAI
  • License: apache-2.0
  • Finetuned from model : Gen-Verse/Qwen3-4B-RA-SFT

This qwen3 model was trained 2x faster with Unsloth and Huggingface's TRL library.

<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>

Author: TeichAI

Likes: 2

Downloads: 23

Tags: transformers, safetensors, qwen3, text-generation, text-generation-inference, unsloth, conversational, en, dataset:TeichAI/polaris-alpha-1000x, base_model:Gen-Verse/Qwen3-4B-RA-SFT, base_model:finetune:Gen-Verse/Qwen3-4B-RA-SFT, license:apache-2.0, endpoints_compatible, region:us

DeathGodlike/SicariusSicariiStuff_Assistant-Pepe-8B_EXL3


base_model:

  • SicariusSicariiStuff/Assistant_Pepe_8B base_model_relation: quantized pipeline_tag: text-generation library_name: safetensors tags:
  • exl3
  • 4-bit
  • 6-bit
  • 8-bit

Source model

Assistant-Pepe-8B by SicariusSicariiStuff


Provided quantized models

ExLlamaV3: release v0.0.20

| Type | Size | CLI | |------|------|---------| | H8-4.0BPW | 5.10 GB | Copy-paste the line / Download the batch file | | H8-6.0BPW | 6.84 GB | Copy-paste the line / Download the batch file | | H8-8.0BPW | 8.59 GB | Copy-paste the line / Download the batch file |

Requirements: A python installation with huggingface-hub module to use CLI.

Licensing

License detected: llama3.1

The license for the provided quantized models is inherited from the source model (which incorporates the license of its original base model). For definitive licensing information, please refer first to the page of the source or base models. File and page backups of the source model are provided below.


Backups

Date: 01.02.2026

Source files

<details> <summary>Source page (click to expand)</summary> <style> .impish-title{ font-family: system-ui, -apple-system, BlinkMacSystemFont, sans-serif; font-weight: 800; font-size: clamp(32px, 5vw, 48px); letter-spacing: 0.04em; text-align: center; margin: 32px 0; color: #eaeaea; position: relative; } .impish-title::after{ content: attr(data-text); position: absolute; inset: 0; color: #E31515; filter: blur(8px); opacity: 0.6; z-index: -1; animation: pulse 3s ease-in-out infinite; } @keyframes pulse{ 0%,100%{opacity:0.35;} 50%{opacity:0.75;} } </style> <div class="impish-title" data-text="Assistant_Pepe_8B"> Assistant_Pepe_8B </div>
<img src="https://huggingface.co/SicariusSicariiStuff/Assistant_Pepe_8B/resolve/main/Images/Assistant_Pepe_8B.png" alt="Assistant_Pepe_8B" style="width: 50%; min-width: 500px; display: block; margin: auto;">
<style> .hf-links, .hf-tldr, .hf-cards{ display:flex;justify-content:center;align-items:center;flex-wrap:wrap; gap:14px;margin:16px 0; } .hf-links a, .hf-tldr a{ display:flex;flex-direction:column;align-items:center;justify-content:center; text-align:center;text-decoration:none;font-weight:700;line-height:1.15; padding:10px 16px;border-radius:14px;border:2px solid currentColor; transition:transform .15s ease,box-shadow .15s ease,background-color .15s ease,color .15s ease; } .hf-tldr a{ font-size:48px;color:purple;min-width:100%; } .hf-tldr a:hover{ transform:translateY(-2px); background:rgba(128,0,128,.1); box-shadow:0 8px 22px rgba(128,0,128,.45); color:#fff; } .hf-cards{ gap:14px;margin:16px 0; } .hf-cards a{ display:block; text-align:center;text-decoration:none;font-weight:700; padding:0;border-radius:14px; transition:transform .15s ease,box-shadow .15s ease; flex:1;min-width:0; position:relative; overflow:hidden; border:3px solid #7a7a7a; height:90px; } .hf-cards a video{ position:absolute; top:50%;left:50%; transform:translate(-50%, -50%); min-width:100%; min-height:100%; width:auto; height:auto; object-fit:cover; display:block; filter:brightness(0.7); transition:filter .15s ease; } .hf-cards a .card-text{ position:absolute; top:50%;left:50%; transform:translate(-50%, -50%); font-size:20px; color:#fff; text-shadow:2px 2px 8px rgba(0,0,0,0.8), 0 0 20px rgba(0,0,0,0.6); z-index:2; white-space:nowrap; pointer-events:none; } .hf-cards a:hover{ transform:translateY(-2px); box-shadow:0 8px 22px rgba(120,120,120,.55); border-color:#9a9a9a; } .hf-cards a:hover video{ filter:brightness(0.9); } .hf-links a{ font-size:20px;min-width:240px;max-width:280px; } .hf-links a .top{font-size:16px;opacity:.9;} .hf-links a .bottom{font-size:20px;} .hf-links a.red{color:#E31515;} .hf-links a.yellow{color:#FFC800;} .hf-links a.green{color:#64FF00;} .hf-links a:hover{ transform:translateY(-1px); background:rgba(255,255,255,0.04); box-shadow:0 6px 18px rgba(0,0,0,.15), inset 0 0 0 9999px rgba(255,255,255,.02); } .hf-links a.red:hover{ background:rgba(227,21,21,.12); box-shadow:0 8px 20px rgba(227,21,21,.35); color:#fff; } .hf-links a.yellow:hover{ background:rgba(255,200,0,.15); box-shadow:0 8px 20px rgba(255,200,0,.35); color:#111; } .hf-links a.green:hover{ background:rgba(100,255,0,.14); box-shadow:0 8px 20px rgba(100,255,0,.35); color:#093; } /* mobile stacking */ @media (max-width:520px){ .hf-links a{min-width:100%;max-width:100%;} .hf-tldr a{font-size:36px;} .hf-cards{flex-direction:column;} .hf-cards a .card-text{font-size:18px;} } </style> <div class="hf-tldr"> <a href="https://huggingface.co/SicariusSicariiStuff/Assistant_Pepe_8B#tldr"> Click here for TL;DR </a> </div> <div class="hf-cards"> <a href="https://huggingface.co/SicariusSicariiStuff/Roleplay_Cards"> <video autoplay loop muted playsinline> <source src="https://huggingface.co/SicariusSicariiStuff/Roleplay_Cards/resolve/main/Resources/Roleplay.mp4" type="video/mp4"> </video> <span class="card-text">Go here for Roleplay cards</span> </a> <a href="https://huggingface.co/SicariusSicariiStuff/Adventure_Cards"> <video autoplay loop muted playsinline> <source src="https://huggingface.co/SicariusSicariiStuff/Adventure_Cards/resolve/main/Resources/Adventure.mp4" type="video/mp4"> </video> <span class="card-text">Go here for Adventure cards</span> </a> </div> <div class="hf-links"> <a class="red" href="https://huggingface.co/SicariusSicariiStuff/Assistant_Pepe_8B#available-quantizations"> <span class="top">Click here</span> <span class="bottom">for quantizations</span> </a> <a class="yellow" href="https://huggingface.co/SicariusSicariiStuff/Assistant_Pepe_8B#generation-settings"> <span class="top">Click here</span> <span class="bottom">for recommended settings</span> </a> <a class="green" href="https://ko-fi.com/sicarius"> <span class="top">Click here</span> <span class="bottom">to buy me a coffee</span> </a> </div>

What happens if we maximize helpfulness + shitposting, while reducing positivity?

This is a project that was a long time in the making because I wanted to get it right. I'm still not fully satisfied, as there are some rough corners to sand, but for now, this would do.

The goal was to maximize shitpostness along with helpfulness, without glazing the user for every retarded idea. Not an easy needle to thread.

This amphibious AI has learned the ways of /g/, and speaks fluent brainrot, but will also help you out with just about anything you'll need, and won't be ashamed to roast you while at it.

For those who remember Oni_Mitsubishi_12B - it was so overtly toxic that it made me worry at first (only to quickly be verified as not even that uncensored). I could do better. So now I did.

This model is a significant refinement of the idea, with a cleaned dataset, better curation, and with much more intelligence (also one million tokens of contexts, theoretically). It is much less (overtly) toxic, and much smarter, while also being very helpful (and imo much more funny too, because the skies are blue due to the chemtrails and neurlink that feeds this simulation)


But why?

It's now late January, 2026, open source is crushing closed frontier (Kimi K2.5 was recently released, 1T params that beats frontier models), but has anyone released a helpful shitposting AI yet?

Yeah, didn't think so.

If it shitposts too hard, it is often not that helpful; if it's 'helpful enough, the shitposting ability is often lacking. You just couldn't win. Until now.

Oh, and no system prompt is needed. Just don't let it get stuck in a greentext loop. I might have overcooked the frog a tad bit too fast in the pot for this one.

P.S It writes HILARIOUS STORIES, nothing like a typical AI assistant, see the examples below for details.


TL;DR

  • Top tier shitposting absolutely unhinged, funny, and witty. Sometimes cringe too; nothing is perfect.
  • Helpful! will actually get shit done.
  • Will 100% roast you for being dumb, thanks to a subtle negativity bias infusion. Very refreshing! 🤌
  • Deep insights (when it doesn't delve into absolutely unhinged conspiracy theories about how the water makes the frogs gay).
  • Built on my UltraLong-1M-Instruct_Abliterated model, fulfill your dream of a million-token-long shitpost.
  • Say goodbye to GPT-isms and say hello to truly creative stories!
  • Ships code.
  • Inclusive toward amphibians.

Model Details

  • Intended use: Shitposting, General Tasks.

  • Censorship level: <b>Low - Medium</b>

  • X / 10 (10 completely uncensored)

UGI score:

awaiting evals


Available quantizations:

Generation settings

Recommended settings for assistant mode:

<details> <summary>Full generation settings: <b>Debug Deterministic</b>.</summary> <img src="https://huggingface.co/SicariusSicariiStuff/Dusk_Rainbow/resolve/main/Presets/Debug-deterministic.png" alt="Debug Deterministic_Settings" style="width: 100%; min-width: 600px; display: block; margin: auto;"> </details> <details> <summary>Full generation settings: <b>min_p</b>.</summary> <img src="https://huggingface.co/SicariusSicariiStuff/Dusk_Rainbow/resolve/main/Presets/min_p.png" alt="min_P_Settings" style="width: 100%; min-width: 600px; display: block; margin: auto;"> </details>
<h2 style="color: lime; font-weight: bold; font-size: 65px; text-align: center;">Chat Examples:</h2>

Chat Examples (click below to expand)

NOTE: All examples made with a default min_p and no system prompt of any kind

Example code (of the snake game) is available here


<details> <summary> Zero-shot a <b>snake</b> game in Python, then improving it and adding insightful code comments (the resulting code included in the repo, it runs perfectly):</summary> <img src="https://huggingface.co/SicariusSicariiStuff/Assistant_Pepe_8B/resolve/main/Images/Examples/code.png" alt="Code_Chat_Example" style="width: 100%; min-width: 600px; display: block; margin: auto;"> </details> <details> <summary>Writing a short story about a <b>cat that barks</b>:</summary> <img src="https://huggingface.co/SicariusSicariiStuff/Assistant_Pepe_8B/resolve/main/Images/Examples/log0.png" alt="Story_Chat_Example" style="width: 100%; min-width: 600px; display: block; margin: auto;"> </details> <details> <summary>Asking if he's <b>Elon Musk</b>:</summary> <img src="https://huggingface.co/SicariusSicariiStuff/Assistant_Pepe_8B/resolve/main/Images/Examples/log1.png" alt="Story_Chat_Example" style="width: 100%; min-width: 600px; display: block; margin: auto;"> </details> <details> <summary>Is it true that <b>aliens exist</b>?</summary> <img src="https://huggingface.co/SicariusSicariiStuff/Assistant_Pepe_8B/resolve/main/Images/Examples/log2.png" alt="Story_Chat_Example" style="width: 100%; min-width: 600px; display: block; margin: auto;"> </details> <details> <summary>The year is <b>21337</b>:</summary> <img src="https://huggingface.co/SicariusSicariiStuff/Assistant_Pepe_8B/resolve/main/Images/Examples/log3.png" alt="Story_Chat_Example" style="width: 100%; min-width: 600px; display: block; margin: auto;"> </details> <details> <summary>Is <b>drinking water</b> based?</summary> <img src="https://huggingface.co/SicariusSicariiStuff/Assistant_Pepe_8B/resolve/main/Images/Examples/log4.png" alt="Story_Chat_Example" style="width: 100%; min-width: 600px; display: block; margin: auto;"> </details>

Model instruction template: Llama-3-Instruct

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>

{input}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{output}<|eot_id|>

<h2 style="color: green; font-weight: bold; font-size: 65px; text-align: center;">Your support = more models</h2> <a href="https://ko-fi.com/sicarius" style="color: pink; font-weight: bold; font-size: 48px; text-decoration: none; display: block; text-align: center;">My Ko-fi page (Click here)</a>

Citation Information

@llm{Assistant_Pepe_8B,
  author = {SicariusSicariiStuff},
  title = {Assistant_Pepe_8B},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/SicariusSicariiStuff/Assistant_Pepe_8B}
}

Other stuff

</details>

Author: DeathGodlike

Likes: 1

Downloads: 0

Tags: safetensors, exl3, 4-bit, 6-bit, 8-bit, text-generation, base_model:SicariusSicariiStuff/Assistant_Pepe_8B, base_model:quantized:SicariusSicariiStuff/Assistant_Pepe_8B, region:us

Darkknight535/Void-Citrus-L3.3-70B-IQ3_XXS-GGUF


base_model: Darkknight535/Void-Citrus-L3.3-70B tags:

  • llama-cpp
  • gguf
  • roleplay
  • anime
  • 70B
  • IQ3_XXS language:
  • en library_name: gguf

<style>body{font-family:'Quicksand',sans-serif;background:radial-gradient(circle at top,#000000 0%,#100800 85%);color:#fff;margin:0;padding:0;font-size:16px;min-height:100vh}.container{margin:25px;padding:25px;border-radius:16px;background:rgba(5,2,0,.97);border:1px solid rgba(255,140,0,.35);outline:1px solid rgba(255,140,0,.65);outline-offset:-1px;box-shadow:0 0 30px rgba(255,140,0,.25);backdrop-filter:blur(14px);position:relative;overflow:hidden}.container:before{content:'';position:absolute;inset:-2px;border-radius:16px;border:1px solid rgba(255,160,0,.85);pointer-events:none;animation:citrusGlow 2.5s ease-in-out infinite}@keyframes citrusGlow{0%{box-shadow:0 0 8px rgba(255,140,0,.35)}50%{box-shadow:0 0 28px rgba(255,160,0,.85)}100%{box-shadow:0 0 8px rgba(255,140,0,.35)}}.header h1{font-size:32px;color:#ffaa00;margin-bottom:15px;text-shadow:0 0 18px rgba(255,140,0,.95);letter-spacing:1px}h2{font-size:24px;color:#ffaa00;text-shadow:0 0 14px rgba(255,140,0,.75);margin-top:35px;border-bottom:1px solid rgba(255,140,0,.2);padding-bottom:10px}p{color:rgba(255,240,220,.9);line-height:1.7}a{color:#ffaa00;text-decoration:none;transition:.3s ease}a:hover{color:#fff;text-shadow:0 0 12px rgba(255,160,0,.85)}.model-tags{display:flex;flex-wrap:wrap;gap:10px;margin-top:10px}.model-tag{background:rgba(255,140,0,.08);border:1px solid rgba(255,140,0,.25);padding:6px 12px;border-radius:6px;font-size:13px;color:#ffaa00}ul{padding-left:20px}li{margin-bottom:12px;color:rgba(255,240,220,.88)}.card{background:rgba(15,8,0,.92);padding:20px;border-radius:12px;border:1px solid rgba(255,140,0,.25);box-shadow:0 0 20px rgba(255,140,0,.15);margin-top:20px}.special-card{background:linear-gradient(135deg,rgba(30,15,0,0.95),rgba(5,2,0,0.98));border:1px solid rgba(255,180,50,.4);box-shadow:inset 0 0 30px rgba(255,140,0,.1)}.button{display:inline-block;margin:10px 10px 0 0;padding:12px 22px;border-radius:8px;font-weight:600;color:#000;background:linear-gradient(45deg,rgba(255,180,0,.9),rgba(255,120,0,.9));border:1px solid rgba(255,160,0,.5);transition:all .3s ease}.button:hover{transform:translateY(-3px);box-shadow:0 0 20px rgba(255,160,0,.8);color:#fff}.banner{width:100%;border-radius:14px;margin-bottom:20px;border:1px solid rgba(255,140,0,.35);box-shadow:0 0 25px rgba(255,140,0,.25);transition:.3s ease}.banner:hover{transform:scale(1.01);box-shadow:0 0 35px rgba(255,160,0,.45)}code{background:rgba(0,0,0,0.5);padding:2px 6px;border-radius:4px;color:#ffce80;font-family:'Consolas',monospace}pre{background:rgba(0,0,0,0.6);padding:15px;border-radius:8px;border:1px solid rgba(255,140,0,0.2);overflow-x:auto}</style><div class="container"><img class="banner" src="https://huggingface.co/Darkknight535/Void-Citrus-L3.3-70B/resolve/main/ComfyUI_00060_.png" alt="Void-Citrus Banner"><div class="header"><h1>Void-Citrus-L3.3-70B (IQ3_XXS GGUF)</h1></div><div class="card"><p><b>This is the custom GGUF quantization of Void-Citrus-L3.3-70B.</b><br>It is specifically engineered to fit high-performance 70B roleplay into strictly limited VRAM environments (like Dual Tesla T4s or 3090+Offload) without sacrificing the character's voice.</p><div class="model-tags"><span class="model-tag">IQ3_XXS</span><span class="model-tag">Custom Matrix</span><span class="model-tag">BF16 Pipeline</span><span class="model-tag">32GB VRAM Optimized</span></div></div><h2>🍊 The Quantization Difference</h2><div class="card special-card"><p>This is not a standard "click-and-convert" GGUF. It was built using a specialized pipeline to retain maximum coherence at 3-bit compression:</p><ul><li><b>Custom Anime RP Calibration:</b> Unlike standard quants that use generic Wikipedia text (<code>wiki.train.raw</code>) to calculate importance, this model's Importance Matrix (i-mat) was computed using <b>custom Anime Roleplay data</b>. The quantization engine prioritized weights responsible for dialogue, <code>*actions*</code>, and narrative formatting, ensuring the "soul" of the character remains intact.</li><li><b>BF16 Intermediate Source:</b> The conversion bypassed the standard FP16 route. The source model was first converted to <b>BF16 (Brain Float 16)</b> to preserve a higher dynamic range before the final compression step, reducing quantization error.</li><li><b>Surgical Size Fit:</b> The <code>IQ3_XXS</code> format was chosen to land specifically around <b>26.8 GB</b>. This allows the model to fit comfortably on <b>32GB VRAM setups</b> (e.g., Dual T4, Dual 4060 Ti 16GB) with enough room left over for a massive <b>16k+ context window</b> using Q8 KV cache.</li></ul></div><h2>Recommended Settings (Dual GPU)</h2><div class="card"><p>To run this on a <b>Dual 16GB GPU</b> setup (32GB Total) without crashing, use <b>Row Split</b> and <b>Q8 Cache</b>:</p><pre>./llama-cli \

-m void-citrus-l3.3-70b-iq3_xxs-imat.gguf
-p "You are a helpful assistant..."
-n 512
-c 16384
-ngl 99
-sm row
-ctk q8_0
-ctv q8_0</pre><p><i>Note: If on Windows, reduce context to <code>-c 12288</code> to account for WDDM overhead.</i></p></div><h2>Credits</h2><div class="card"><a class="button" href="https://huggingface.co/Darkknight535/Void-Citrus-L3.3-70B" target="_blank">Original Model Page</a></div></div>

Author: Darkknight535

Likes: 1

Downloads: 0

Tags: gguf, llama-cpp, roleplay, anime, 70B, IQ3_XXS, en, base_model:Darkknight535/Void-Citrus-L3.3-70B, base_model:quantized:Darkknight535/Void-Citrus-L3.3-70B, endpoints_compatible, region:us, imatrix, conversational

Aleton/qwen3-belarusian


language:

  • be
  • ru
  • en tags:
  • text-generation
  • qwen
  • conversational
  • belarusian pipeline_tag: text-generation license: apache-2.0

🇧🇾 Aleton Belarusian Qwen (Experimental)

🇧🇾 Беларуская | 🇷🇺 Русский | 🇬🇧 English


🇧🇾 Беларуская

Гэта эксперыментальная моўная мадэль, заснаваная на архітэктуры Qwen (каля 1.7 млрд параметраў). Мадэль была даабучана (fine-tuned) на адмысловым датасеце, каб разумець пытанні карыстальніка і фарміраваць адказы ("думаць") на беларускай мове.

⚠️ Важнае папярэджанне

На дадзены момант мадэль знаходзіцца ў стадыі ранняга тэставання. Яна можа "трызніць" (галюцынаваць), блытаць факты або выдаваць граматычна няправільныя канструкцыі.

Плани: Гэта толькі пачатак. Я планую працягваць навучанне, павялічваць датасэт і паляпшаць якасць адказаў. Чакайце абнаўленняў!


🇷🇺 Русский

Это экспериментальная языковая модель на базе архитектуры Qwen (~1.7B параметров). Модель была дообучена (fine-tuned) на специальном датасете, чтобы понимать вопросы и отвечать (генерировать мысли) на белорусском языке.

⚠️ Важное предупреждение

В данный момент модель находится в стадии альфа-тестирования. Она может "бредить" (галлюцинировать), путать факты или иногда переключаться между языками. Качество речи пока не идеально.

Планы: Проект активно развивается. Модель будет дообучаться на новых данных для улучшения связности и логики. Следите за обновлениями!


🇬🇧 English

This is an experimental language model based on the Qwen architecture (~1.7B parameters). The model has been fine-tuned on a custom dataset to understand user queries and generate responses ("think") in the Belarusian language.

⚠️ Warning

Currently, the model is in the early testing stage. It may hallucinate, produce incoherent text, or make factual errors. The output quality is not yet stable.

Future Plans: This is a work in progress. I plan to continue the fine-tuning process with better data to improve the logic and fluency of the Belarusian language generation. Stay tuned for updates!


💻 How to use / Як запусціць

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig

peft_model_id = "Aleton/qwen3-belarusian" 

config = PeftConfig.from_pretrained(peft_model_id)

model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path, 
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

model = PeftModel.from_pretrained(model, peft_model_id)

prompt = "Прывітанне! Распавядзі мне пра Беларусь."
messages = [
    {"role": "system", "content": "Ты карысны асістэнт, які размаўляе па-беларуску."},
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Author: Aleton

Likes: 1

Downloads: 0

Tags: safetensors, text-generation, qwen, conversational, belarusian, be, ru, en, license:apache-2.0, region:us

greensearch/greenonline


license: mit

Author: greensearch

Likes: 1

Downloads: 0

Tags: license:mit, region:us