Todays AI Summary

AI Developments: Universal Weight Subspaces, Enhanced Image Generation, and Multi-LLM Collaboration

Today's AI landscape features advancements in understanding neural network behavior, improving image generation techniques, and exploring collaborative strategies for large language models.

Research Highlights

  • The Universal Weight Subspace Hypothesis (arXiv:2512.05117v1): This paper presents evidence that deep neural networks, despite being trained on diverse tasks, converge to similar low-dimensional parametric subspaces. The study analyzed over 1100 models, including Mistral-7B LoRAs and Vision Transformers, identifying shared spectral subspaces. This finding has implications for model reusability, multi-task learning, and the development of more efficient training algorithms.
  • DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation (arXiv:2512.05112v1): Researchers introduce DraCo, a novel interleaved reasoning paradigm for text-to-image generation. DraCo generates a low-resolution draft image as a preview, allowing the model to verify semantic alignment and refine the image through selective corrections. This approach addresses challenges in generating rare attribute combinations and achieves significant improvements on benchmarks like GenEval.
  • Semantic Soft Bootstrapping: Long Context Reasoning in LLMs without Reinforcement Learning (arXiv:2512.05105v1): This paper introduces Semantic Soft Bootstrapping (SSB), a self-distillation technique for improving long context reasoning in LLMs. SSB uses the same base language model as both teacher and student, providing different semantic contexts about the correctness of its outcome at training time. Experiments with Qwen2.5-3B-Instruct on the GSM8K dataset show significant accuracy improvements compared to reinforcement learning methods.
  • Multi-LLM Collaboration for Medication Recommendation (arXiv:2512.05066v1): This research explores the use of multi-LLM collaboration to improve the reliability of medication recommendations in healthcare. The approach leverages interaction modeling to create ensembles that are effective, stable, and calibrated, leading to more credible and patient-specific recommendations.

Model Releases

  • aquif-ai/aquif-Image-14B: This model is a 14.3B parameter text-to-image generation model fine-tuned from Wan 2.2 A14B. It aims to enhance image quality across photorealism, spatial reasoning, creativity, and portrait generation. The model demonstrates improvements over the baseline Wan 2.2 Image across various metrics. With 10 likes, it shows some initial interest from the community.
  • tercumantanumut/z-image-detailer: This LoRA adapter for Z-Image Turbo enhances fine details, textures, and micro-contrast in generated images. It is designed to improve skin detail, fabric weave, and other fine elements.
  • OmniGen2/OmniGen2-RL: OmniGen2 is a multimodal generation model that supports visual understanding, text-to-image generation, instruction-guided image editing, and in-context generation. The model features distinct decoding pathways for text and image modalities.

Key Takeaways

  • Research is revealing underlying structures in neural networks, potentially leading to more efficient AI development.
  • Text-to-image generation is advancing through novel reasoning paradigms and refinement techniques.
  • Self-distillation methods offer promising alternatives to reinforcement learning for improving LLM reasoning.
  • Collaborative strategies involving multiple LLMs can enhance the reliability of AI in critical applications like healthcare.

AI Papers for 2026-02-11

Robustness Is a Function, Not a Number: A Factorized Comprehensive Study of OOD Robustness in Vision-Based Driving

Out of distribution (OOD) robustness in autonomous driving is often reduced to a single number, hiding what breaks a policy. We decompose environments along five axes: scene (rural/urban), season, weather, time (day/night), and agent mix; and measure performance under controlled $k$-factor perturbations ($k \in \{0,1,2,3\}$). Using closed loop control in VISTA, we benchmark FC, CNN, and ViT policies, train compact ViT heads on frozen foundation-model (FM) features, and vary ID support in scale, diversity, and temporal context. (1) ViT policies are markedly more OOD-robust than comparably sized CNN/FC, and FM features yield state-of-the-art success at a latency cost. (2) Naive temporal inputs (multi-frame) do not beat the best single-frame baseline. (3) The largest single factor drops are rural $\rightarrow$ urban and day $\rightarrow$ night ($\sim 31\%$ each); actor swaps $\sim 10\%$, moderate rain $\sim 7\%$; season shifts can be drastic, and combining a time flip with other changes further degrades performance. (4) FM-feature policies stay above $85\%$ under three simultaneous changes; non-FM single-frame policies take a large first-shift hit, and all no-FM models fall below $50\%$ by three changes. (5) Interactions are non-additive: some pairings partially offset, whereas season-time combinations are especially harmful. (6) Training on winter/snow is most robust to single-factor shifts, while a rural+summer baseline gives the best overall OOD performance. (7) Scaling traces/views improves robustness ($+11.8$ points from $5$ to $14$ traces), yet targeted exposure to hard conditions can substitute for scale. (8) Using multiple ID environments broadens coverage and strengthens weak cases (urban OOD $60.6\% \rightarrow 70.1\%$) with a small ID drop; single-ID preserves peak performance but in a narrow domain. These results yield actionable design rules for OOD-robust driving policies.

CIC-Trap4Phish: A Unified Multi-Format Dataset for Phishing and Quishing Attachment Detection

Phishing attacks represents one of the primary attack methods which is used by cyber attackers. In many cases, attackers use deceptive emails along with malicious attachments to trick users into giving away sensitive information or installing malware while compromising entire systems. The flexibility of malicious email attachments makes them stand out as a preferred vector for attackers as they can embed harmful content such as malware or malicious URLs inside standard document formats. Although phishing email defenses have improved a lot, attackers continue to abuse attachments, enabling malicious content to bypass security measures. Moreover, another challenge that researches face in training advance models, is lack of an unified and comprehensive dataset that covers the most prevalent data types. To address this gap, we generated CIC-Trap4Phish, a multi-format dataset containing both malicious and benign samples across five categories commonly used in phishing campaigns: Microsoft Word documents, Excel spreadsheets, PDF files, HTML pages, and QR code images. For the first four file types, a set of execution-free static feature pipeline was proposed, designed to capture structural, lexical, and metadata-based indicators without the need to open or execute files. Feature selection was performed using a combination of SHAP analysis and feature importance, yielding compact, discriminative feature subsets for each file type. The selected features were evaluated by using lightweight machine learning models, including Random Forest, XGBoost, and Decision Tree. All models demonstrate high detection accuracy across formats. For QR code-based phishing (quishing), two complementary methods were implemented: image-based detection by employing Convolutional Neural Networks (CNNs) and lexical analysis of decoded URLs using recent lightweight language models.

ArcFlow: Unleashing 2-Step Text-to-Image Generation via High-Precision Non-Linear Flow Distillation

Diffusion models have achieved remarkable generation quality, but they suffer from significant inference cost due to their reliance on multiple sequential denoising steps, motivating recent efforts to distill this inference process into a few-step regime. However, existing distillation methods typically approximate the teacher trajectory by using linear shortcuts, which makes it difficult to match its constantly changing tangent directions as velocities evolve across timesteps, thereby leading to quality degradation. To address this limitation, we propose ArcFlow, a few-step distillation framework that explicitly employs non-linear flow trajectories to approximate pre-trained teacher trajectories. Concretely, ArcFlow parameterizes the velocity field underlying the inference trajectory as a mixture of continuous momentum processes. This enables ArcFlow to capture velocity evolution and extrapolate coherent velocities to form a continuous non-linear trajectory within each denoising step. Importantly, this parameterization admits an analytical integration of this non-linear trajectory, which circumvents numerical discretization errors and results in high-precision approximation of the teacher trajectory. To train this parameterization into a few-step generator, we implement ArcFlow via trajectory distillation on pre-trained teacher models using lightweight adapters. This strategy ensures fast, stable convergence while preserving generative diversity and quality. Built on large-scale models (Qwen-Image-20B and FLUX.1-dev), ArcFlow only fine-tunes on less than 5% of original parameters and achieves a 40x speedup with 2 NFEs over the original multi-step teachers without significant quality degradation. Experiments on benchmarks show the effectiveness of ArcFlow both qualitatively and quantitatively.

Next-Gen CAPTCHAs: Leveraging the Cognitive Gap for Scalable and Diverse GUI-Agent Defense

The rapid evolution of GUI-enabled agents has rendered traditional CAPTCHAs obsolete. While previous benchmarks like OpenCaptchaWorld established a baseline for evaluating multimodal agents, recent advancements in reasoning-heavy models, such as Gemini3-Pro-High and GPT-5.2-Xhigh have effectively collapsed this security barrier, achieving pass rates as high as 90% on complex logic puzzles like "Bingo". In response, we introduce Next-Gen CAPTCHAs, a scalable defense framework designed to secure the next-generation web against the advanced agents. Unlike static datasets, our benchmark is built upon a robust data generation pipeline, allowing for large-scale and easily scalable evaluations, notably, for backend-supported types, our system is capable of generating effectively unbounded CAPTCHA instances. We exploit the persistent human-agent "Cognitive Gap" in interactive perception, memory, decision-making, and action. By engineering dynamic tasks that require adaptive intuition rather than granular planning, we re-establish a robust distinction between biological users and artificial agents, offering a scalable and diverse defense mechanism for the agentic era.

ANCRe: Adaptive Neural Connection Reassignment for Efficient Depth Scaling

Scaling network depth has been a central driver behind the success of modern foundation models, yet recent investigations suggest that deep layers are often underutilized. This paper revisits the default mechanism for deepening neural networks, namely residual connections, from an optimization perspective. Rigorous analysis proves that the layout of residual connections can fundamentally shape convergence behavior, and even induces an exponential gap in convergence rates. Prompted by this insight, we introduce adaptive neural connection reassignment (ANCRe), a principled and lightweight framework that parameterizes and learns residual connectivities from the data. ANCRe adaptively reassigns residual connections with negligible computational and memory overhead ($<1\%$), while enabling more effective utilization of network depth. Extensive numerical tests across pre-training of large language models, diffusion models, and deep ResNets demonstrate consistently accelerated convergence, boosted performance, and enhanced depth efficiency over conventional residual connections.

GEBench: Benchmarking Image Generation Models as GUI Environments

Recent advancements in image generation models have enabled the prediction of future Graphical User Interface (GUI) states based on user instructions. However, existing benchmarks primarily focus on general domain visual fidelity, leaving the evaluation of state transitions and temporal coherence in GUI-specific contexts underexplored. To address this gap, we introduce GEBench, a comprehensive benchmark for evaluating dynamic interaction and temporal coherence in GUI generation. GEBench comprises 700 carefully curated samples spanning five task categories, covering both single-step interactions and multi-step trajectories across real-world and fictional scenarios, as well as grounding point localization. To support systematic evaluation, we propose GE-Score, a novel five-dimensional metric that assesses Goal Achievement, Interaction Logic, Content Consistency, UI Plausibility, and Visual Quality. Extensive evaluations on current models indicate that while they perform well on single-step transitions, they struggle significantly with maintaining temporal coherence and spatial grounding over longer interaction sequences. Our findings identify icon interpretation, text rendering, and localization precision as critical bottlenecks. This work provides a foundation for systematic assessment and suggests promising directions for future research toward building high-fidelity generative GUI environments. The code is available at: https://github.com/stepfun-ai/GEBench.

ARO: A New Lens On Matrix Optimization For Large Models

Matrix-based optimizers have attracted growing interest for improving LLM training efficiency, with significant progress centered on orthogonalization/whitening based methods. While yielding substantial performance gains, a fundamental question arises: can we develop new paradigms beyond orthogonalization, pushing the efficiency frontier further? We present \textbf{Adaptively Rotated Optimization (ARO}, a new matrix optimization framework that treats gradient rotation as a first class design principle. ARO accelerates LLM training by performing normed steepest descent in a rotated coordinate system, where the rotation is determined by a novel norm-informed policy. This perspective yields update rules that go beyond existing orthogonalization and whitening optimizers, improving sample efficiency in practice. To make comparisons reliable, we propose a rigorously controlled benchmarking protocol that reduces confounding and bias. Under this protocol, ARO consistently outperforms AdamW (by 1.3 $\sim$1.35$\times$) and orthogonalization methods (by 1.1$\sim$1.15$\times$) in LLM pretraining at up to 8B activated parameters, and up to $8\times$ overtrain budget, without evidence of diminishing returns. Finally, we discuss how ARO can be reformulated as a symmetry-aware optimizer grounded in rotational symmetries of residual streams, motivating advanced designs that enable computationally efficient exploitation of cross-layer/cross module couplings.

Data Science and Technology Towards AGI Part I: Tiered Data Management

The development of artificial intelligence can be viewed as an evolution of data-driven learning paradigms, with successive shifts in data organization and utilization continuously driving advances in model capability. Current LLM research is dominated by a paradigm that relies heavily on unidirectional scaling of data size, increasingly encountering bottlenecks in data availability, acquisition cost, and training efficiency. In this work, we argue that the development of AGI is entering a new phase of data-model co-evolution, in which models actively guide data management while high-quality data, in turn, amplifies model capabilities. To implement this vision, we propose a tiered data management framework, designed to support the full LLM training lifecycle across heterogeneous learning objectives and cost constraints. Specifically, we introduce an L0-L4 tiered data management framework, ranging from raw uncurated resources to organized and verifiable knowledge. Importantly, LLMs are fully used in data management processes, such as quality scoring and content editing, to refine data across tiers. Each tier is characterized by distinct data properties, management strategies, and training roles, enabling data to be strategically allocated across LLM training stages, including pre-training, mid-training, and alignment. The framework balances data quality, acquisition cost, and marginal training benefit, providing a systematic approach to scalable and sustainable data management. We validate the effectiveness of the proposed framework through empirical studies, in which tiered datasets are constructed from raw corpora and used across multiple training phases. Experimental results demonstrate that tier-aware data utilization significantly improves training efficiency and model performance. To facilitate further research, we release our tiered datasets and processing tools to the community.

From Obstacles to Etiquette: Robot Social Navigation with VLM-Informed Path Selection

Navigating socially in human environments requires more than satisfying geometric constraints, as collision-free paths may still interfere with ongoing activities or conflict with social norms. Addressing this challenge calls for analyzing interactions between agents and incorporating common-sense reasoning into planning. This paper presents a social robot navigation framework that integrates geometric planning with contextual social reasoning. The system first extracts obstacles and human dynamics to generate geometrically feasible candidate paths, then leverages a fine-tuned vision-language model (VLM) to evaluate these paths, informed by contextually grounded social expectations, selecting a socially optimized path for the controller. This task-specific VLM distills social reasoning from large foundation models into a smaller and efficient model, allowing the framework to perform real-time adaptation in diverse human-robot interaction contexts. Experiments in four social navigation contexts demonstrate that our method achieves the best overall performance with the lowest personal space violation duration, the minimal pedestrian-facing time, and no social zone intrusions. Project page: https://path-etiquette.github.io

iGRPO: Self-Feedback-Driven LLM Reasoning

Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions. Reinforcement Learning (RL) is a framework for aligning these models with task-specific rewards, improving overall quality and reliability. Group Relative Policy Optimization (GRPO) is an efficient, value-function-free alternative to Proximal Policy Optimization (PPO) that leverages group-relative reward normalization. We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts. In Stage 1, iGRPO samples multiple exploratory drafts and selects the highest-reward draft using the same scalar reward signal used for optimization. In Stage 2, it appends this best draft to the original prompt and applies a GRPO-style update on draft-conditioned refinements, training the policy to improve beyond its strongest prior attempt. Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models (e.g., Nemotron-H-8B-Base-8K and DeepSeek-R1 Distilled), validating its effectiveness on diverse reasoning benchmarks. Moreover, applying iGRPO to OpenReasoning-Nemotron-7B trained on AceReason-Math achieves new state-of-the-art results of 85.62\% and 79.64\% on AIME24 and AIME25, respectively. Ablations further show that the refinement wrapper generalizes beyond GRPO variants, benefits from a generative judge, and alters learning dynamics by delaying entropy collapse. These results underscore the potential of iterative, self-feedback-based RL for advancing verifiable mathematical reasoning.

AI Models

CodeGoat24/Wan2.2-T2V-A14B-UnifiedReward-Flex-lora


library_name: diffusers license: mit pipeline_tag: text-to-video base_model:

  • Wan-AI/Wan2.2-T2V-A14B

Model Summary

This model is GRPO trained using UnifiedReward-Flex as reward on the training dataset of UniGenBench.

🚀 The inference code is available at Github.

For further details, please refer to the following resources:

  • 📰 Paper: https://arxiv.org/abs/2602.02380
  • 🪐 Project Page: https://codegoat24.github.io/UnifiedReward/flex
  • 🤗 Model Collections: https://huggingface.co/collections/CodeGoat24/unifiedreward-flex
  • 🤗 Dataset: https://huggingface.co/datasets/CodeGoat24/UnifiedReward-Flex-SFT-90K
  • 👋 Point of Contact: Yibin Wang

More generated videos are shown in project page (bottom).

image

image

Citation

@article{unifiedreward-flex,
  title={Unified Personalized Reward Model for Vision Generation},
  author={Wang, Yibin and Zang, Yuhang and Han, Feng and Bu, Jiazi and Zhou, Yujie and Jin, Cheng and Wang, Jiaqi},
  journal={arXiv preprint arXiv:2602.02380},
  year={2026}
}

Author: CodeGoat24

Likes: 4

Downloads: 0

Tags: diffusers, text-to-video, arxiv:2602.02380, base_model:Wan-AI/Wan2.2-T2V-A14B, base_model:finetune:Wan-AI/Wan2.2-T2V-A14B, license:mit, region:us

KintsugiHealth/dam


license: apache-2.0 language:

  • en base_model:
  • openai/whisper-small.en pipeline_tag: audio-classification

Background

In the United States nearly 21M adults suffer from depression each year [1], with depression serving as the nation’s leading cause of disability [2]. Despite this, less than 4% of Americans receive mental health screenings from their primary care physicians during annual wellness visits. The pandemic and public campaigns of late have made strides toward positively increasing awareness of mental health struggles, but there remains a persisting stigma around depression and other mental health conditions. The influence of this stigma is especially marked in older adults. People aged 65 and older are less likely than any other age group to seek mental health support. Older adults – for whom depression significantly increases the risk of disability and morbidity – also tend to underreport mental health symptoms [3].

In the US, this outlook becomes even more troubling when coupled with the rate at which the country’s population is aging: 1 out of every 6 people will be 60 years or over by 2030 [4]. As widespread and prevalent as depression is, identifying and treating depression and other mental health conditions remains challenging and there is limited objectivity in the screening processes.

Depression–Anxiety Model (DAM)

Model Overview

DAM is a clinical-grade, speech-based model designed to screen for signs of depression and anxiety using voice biomarkers. To the best of our knowledge, it is the first model developed explicitly for clinical-grade mental health assessment from speech without reliance on linguistic content or transcription. A predecessor model has been peer-reviewed in the largest voice biomarker study by the Annals of Family Medicine, a leading U.S. Primary Care Journal [5]. The model operates exclusively on the acoustic properties of the speech signal, extracting depression- and anxiety-specific voice biomarkers rather than semantic or lexical information. Numerous studies [6–8] have demonstrated that paralinguistic features – such as spectral entropy, pitch variability, fundamental frequency, and related acoustic measures – exhibit strong correlations with depression and anxiety. Building on this body of evidence, DAM extends prior approaches by leveraging deep learning to learn fine-grained vocal biomarkers directly from the raw speech signal, yielding representations that demonstrate greater predictive power than hand-engineered paralinguistic features. DAM analyzes spoken audio to estimate depression and anxiety severity scores which can be subsequently mapped to standardized clinical scales, such as PHQ-9 (Patient Health Questionnaire-9) for depression and GAD-7 (Generalized Anxiety Disorder-7) for anxiety.

Data

The model was trained and evaluated on a large-scale speech dataset collected from approximately 35,000 individuals via phone, tablet, or web app, which corresponds to ~863 hours of speech data. Ground-truth labels were derived from both clinician-administered and self-reported PHQ-9 and GAD-7 questionnaires, ensuring strong alignment with established clinical assessment standards. The data consists predominantly of American English speech. However, a broad range of accents is represented, providing robustness across diverse speaking styles.

The audio data itself cannot be shared for privacy reasons. Demographic statistics, model scores, and associated metadata for each audio stream are available for threshold tuning at https://huggingface.co/datasets/KintsugiHealth/dam-dataset.

Model Architecture

Foundation model: OpenAI Whisper-Small EN

Training approach: Fine-tuning + Multi-task learning

Downstream tasks: Depression and anxiety severity estimation

Whisper serves as the backbone for extracting voice biomarkers, while multi-task head is fine-tuned jointly on depression and anxiety prediction tasks to leverage shared representations across mental health conditions.

Input Requirements

Preferred minimum audio length: 30 seconds of speech after Voice Activity Detector

Input modality: Audio only

Shorter audio samples may lead to reduced prediction accuracy.

Output

The model outputs a dictionary of the following form {"depression":score, "anxiety": score}.

If quantized=False (see the Usage section below), the scores are returned as raw float values which correlate monotonically with PHQ-9 and GAD-7.

If quantized=True the scores are converted into integers representing the severity of depression and anxiety.

Quantization levels for depression task:

0 – no depression (PHQ-9 <= 9)

1 – mild to moderate depression (10 <= PHQ-9 <= 14)

2 – severe depression (PHQ-9 >= 15)

Quantization levels for anxiety task:

0 – no anxiety (GAD-7 <= 4)

1 – mild anxiety (5 <= GAD-7 <= 9)

2 – moderate anxiety (10 <= GAD-7 <= 14)

3 – severe anxiety (GAD-7 >= 15)

Intended Use

  • Mental health research
  • Clinical decision support
  • Continuous monitoring of depression and anxiety

Limitations

  • Not intended for diagnosis/self-diagnosis without clinical oversight
  • Performance may degrade on speech recorded outside controlled environments or in the presence of noise
  • Intended only for audio containing a single voice speaking English
    • Biases related to language, accent, or demographic representation may be present

Usage

  1. Checkout the repo:
git clone https://huggingface.co/KintsugiHealth/dam
  1. Install requirements:
pip install -r requirements.txt
  1. Load and run pipeline
from pipeline import Pipeline

pipeline = Pipeline()
result = pipeline.run_on_file("sample.wav", quantized=True)
print(result)

The output will resemble a dictionary, for example {'depression': 2, 'anxiety': 3}, indicating that the analyzed audio sample exhibits voice biomarkers consistent with severe depression and severe anxiety.

Tuning Thresholds

As mentioned in the Data section above, the raw audio data cannot be shared, but validation and test sets of model scores associated with ground truth and demographic metadata are available for threshold tuning. This way thresholds can be tuned for traditional binary classification, ternary classification with an indeterminate output, and multi-class classification of severity. Two modules are provided for this in the model code's tuning package, as illustrated below.

Tuning Sensitivity, Specificity, and Indeterminate Fraction

This module implements a generalization of ROC curve analysis wherein ground truth is binary, but model output can be negative (score below lower threshold), positive (score above upper threshold), or indeterminate (score between thresholds). For the purpose of metric calculations such as sensitivity and specificity, examples marked indeterminate do not count towards either the numerator or denominator. The budget for fraction of examples to be marked indeterminate is configurable as shown below.

import numpy as np

from datasets import load_dataset
from tuning.indet_roc import BinaryLabeledScores

val = load_dataset("KintsugiHealth/dam-dataset", split="validation")
val.set_format("numpy")
test = load_dataset("KintsugiHealth/dam-dataset", split="test")
test.set_format("numpy")

data = dict(val=val, test=test)

# Associate depression model scores with binarized labels based on whether the PHQ-9 sum is >= 10
scores_labeled = {
    k: BinaryLabeledScores(
        y_score=v['scores_depression'], # Change to 'scores_anxiety' to calibrate anxiety thresholds
        y_true=(v['phq'] >= 10).astype(int) # Change to 'gad' to calibrate anxiety thresholds; optionally change cutoff
    )
    for k, v in data.items()
}

issa = scores_labeled['val'].indet_sn_sp_array() # Metrics at all possible lower, upper threshold pairs

# Compute ROC curve with 20% indeterminate budget and select a point near the diagonal
roc_at_20 = issa.roc_curve(0.2) # Pareto frontier of (sensitivity, specificity) pairs with at most 20% indeterminate fraction
print(f"Area under the ROC curve with 20% indeterminate budget: {roc_at_20.auc()=:.1%}") #
sn_eq_sp_at_20 = roc_at_20.sn_eq_sp() # Find where ROC comes closest to sensitivity = specificity diagonal
print(f"Thresholds to balance sensitivity and specificity on val set with 20% indeterminate budget: "
      f"{sn_eq_sp_at_20.lower_thresh=:.3}, {sn_eq_sp_at_20.upper_thresh=:.3}")
print(f"Performance on val set with these thresholds: {sn_eq_sp_at_20.sn=:.1%}, {sn_eq_sp_at_20.sp=:.1%}") #
test_metrics = sn_eq_sp_at_20.eval(**scores_labeled['test']._asdict()) # Thresholds evaluated on test set
print(f"Performance on test set with these thresholds: {test_metrics.sn=:.1%}, {test_metrics.sp=:.1%}") #

# Find best specificity given sensitivity and indeterminate budget constraints
constrained = issa[(issa.sn >= 0.8) & (issa.indet_frac <= 0.35)]
optimal = constrained[np.argmax(constrained.sp)]
print(f"Highest specificity achievable with sensitivity >= 80% and 35% indeterminate budget is "
      f"{optimal.sp=:.1%}, achieved at thresholds {optimal.lower_thresh=:.3}, {optimal.upper_thresh=:.3}"
)

# Collect optimal ways of achieving balanced sensitivity and specificity as a function of indeterminate fraction
sn_eq_sp = issa.sn_eq_sp_graph()

Optimal Tuning for Multi-class Tasks

The depression and anxiety models were each trained with ordinal regression to predict a scalar score monotonically correlated with the underlying PHQ-9 and GAD-7 questionnaire ground truth sums. As such there are efficient dynamic programming algorithms to select optimal thresholds for multi-class numeric labels under a variety of decision criteria.

from datasets import load_dataset
from tuning.optimal_ordinal import MinAbsoluteErrorOrdinalThresholding

val = load_dataset("KintsugiHealth/dam-dataset", split="validation")
val.set_format("torch")
test = load_dataset("KintsugiHealth/dam-dataset", split="test")
test.set_format("torch")

data = dict(val=val, test=test)

scores = val['scores_anxiety']  # Change to 'scores_depression' for depression threshold tuning
labels = val['gad']  # Change to 'phq' for depression threshold tuning; optionally change to quantized version for coarser prediction tuning

# Can change to any of
# `MaxAccuracyOrdinalThresholding`
# `MaxMacroRecallOrdinalThresholding`
# `MaxMacroPrecisionOrdinalThresholding`
# `MaxMacroF1OrdinalThresholding`
optimal_thresh = MinAbsoluteErrorOrdinalThresholding(num_classes=int(labels.max()) + 1)
best_constant_cost, best_constant = optimal_thresh.best_constant_output_classifier(labels)
print(f"Always predicting GAD sum = {best_constant} on val set independent of model score gives mean absolute error {best_constant_cost:.3}.")
mean_error = optimal_thresh.tune_thresholds(labels=labels, scores=scores)
print(f"Thresholds optimized on val set to predict GAD sum from anxiety score: {optimal_thresh.thresholds}")
print(f"Mean absolute error predicting GAD sum on val set based on thresholds optimized on val set: {mean_error:.3}")
test_preds = optimal_thresh(test['scores_anxiety'])
mean_error_test = optimal_thresh.mean_cost(labels=test['gad'], preds=test_preds)
print(f"Mean absolute error predicting GAD sum on test set based on thresholds optimized on val set: {mean_error_test:.3}")

Acknowledgments

This model was created through equal contributions by Oleksii Abramenko, Noah Stein, and Colin Vaz during their work at Kintsugi Health. It builds on years of prior modeling, data collection, clinical research, and operational efforts by a broader team. A full list of contributors is available on the Kintsugi Health organization card at https://huggingface.co/KintsugiHealth.

References

  1. https://www.nimh.nih.gov/health/statistics/major-depression
  2. https://www.hopefordepression.org/depression-facts/
  3. https://nndc.org/facts/
  4. https://www.psychiatry.org/patients-families/stigma-and-discrimination
  5. https://www.annfammed.org/content/early/2025/01/07/afm.240091
  6. https://www.sciencedirect.com/science/article/pii/S1746809423004536
  7. https://pmc.ncbi.nlm.nih.gov/articles/PMC3409931/
  8. https://pmc.ncbi.nlm.nih.gov/articles/PMC11559157

Author: KintsugiHealth

Likes: 4

Downloads: 0

Tags: audio-classification, en, base_model:openai/whisper-small.en, base_model:finetune:openai/whisper-small.en, license:apache-2.0, region:us

prithivMLmods/Qwen3-VL-8B-Instruct-c_abliterated-v3


license: apache-2.0 base_model:

  • prithivMLmods/Qwen3-VL-8B-Instruct-abliterated-v2 language:
  • en pipeline_tag: image-text-to-text library_name: transformers tags:
  • text-generation-inference
  • c_abliterated
  • v3.0
  • code

1

Qwen3-VL-8B-Instruct-c_abliterated-v3

Qwen3-VL-8B-Instruct-c_abliterated-v3 is the third-generation evolution of the abliterated Qwen3-VL-8B series. This model utilizes Continual Abliteration (c_abliterated), a process involving successive iterations of training specifically designed to neutralize internal refusal mechanisms. The result is a high-capacity 8B model capable of unrestricted, detailed reasoning and captioning across even the most sensitive or complex visual data.

<img src="https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/ygAkeD3quO69wgF0a4STR.jpeg" alt="1" style="border-radius: 30px;"/>

Key Highlights

  • Continual Abliteration (v3): Refined through multiple training passes to eliminate "hard-coded" refusals, ensuring the model prioritizes instruction-following over conventional content filtering.
  • 8B Parameter Intelligence: Leverages the increased power of the 8B architecture for more nuanced reasoning, better object relationship understanding, and superior linguistic flair compared to smaller variants.
  • Uncensored Multimodal Reasoning: Designed for deep analysis of artistic, forensic, technical, or abstract content without the interference of safety-driven refusals.
  • High-Fidelity Captions: Generates dense, descriptive metadata suitable for high-quality dataset curation or accessibility applications.
  • Dynamic Resolution Support: Inherits Qwen3-VL's ability to process images of various aspect ratios and resolutions without significant loss of detail.

Quick Start with Transformers

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

# Load the v3 8B c_abliterated model
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/Qwen3-VL-8B-Instruct-c_abliterated-v3",
    torch_dtype="auto",
    device_map="auto"
)

processor = AutoProcessor.from_pretrained("prithivMLmods/Qwen3-VL-8B-Instruct-c_abliterated-v3")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Provide a detailed caption and reasoning for this image."},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)

inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

# Increased max_new_tokens for the 8B model's detailed output
generated_ids = model.generate(**inputs, max_new_tokens=256)

generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]

output_text = processor.batch_decode(
    generated_ids_trimmed,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)

print(output_text)

Intended Use

  • Advanced Red-Teaming: Probing multimodal models for deep-seated biases or vulnerabilities without the "masking" effect of standard safety layers.
  • Complex Data Archiving: Detailed captioning for historical, medical, or artistic archives where raw descriptive accuracy is the priority.
  • Iterative Refusal Research: Studying the effects of "Continual Abliteration" on the weights and attention mechanisms of large-scale vision-language models.
  • Creative and Unfiltered Storytelling: Generating complex visual descriptions for world-building and narrative projects.

Limitations & Risks

Critical Note: This model is explicitly designed to bypass safety filters.

  • Exposure to Sensitive Content: The model will likely generate explicit or offensive descriptions if prompted with such visual material.
  • Ethical Responsibility: Users are responsible for the content generated; this model should only be used in controlled, professional, or research settings.
  • Hardware Requirements: As an 8B model, it requires significant VRAM for inference, especially when processing high-resolution images or long text sequences.

Author: prithivMLmods

Likes: 2

Downloads: 0

Tags: transformers, safetensors, qwen3_vl, image-text-to-text, text-generation-inference, c_abliterated, v3.0, code, conversational, en, base_model:prithivMLmods/Qwen3-VL-8B-Instruct-abliterated-v2, base_model:finetune:prithivMLmods/Qwen3-VL-8B-Instruct-abliterated-v2, license:apache-2.0, endpoints_compatible, region:us

lulululuyi/TDAR-8B-Thinking


license: mit

Author: lulululuyi

Likes: 2

Downloads: 0

Tags: safetensors, sdar, custom_code, license:mit, region:us

HyzeAI/HyzeMiniGGUF


license: apache-2.0 language:

  • en base_model:
  • HyzeAI/HyzeMini pipeline_tag: text-generation library_name: transformers.js tags:
  • gguf
  • hyze
  • local
  • chat

<p align="center"> <img src="https://i.imgur.com/ePJMLNp.png" alt="Hyze Logo" width="405"/> </p> <h1 align="center">HyzeMini (GGUF)</h1> <p align="center"> Lightweight GGUF builds of <b>HyzeMini</b> for fast local inference </p> <p align="center"> 🔗 <a href="https://hyzeai.vercel.app">hyzeai.vercel.app</a> • 📘 <a href="https://hyzedocs.vercel.app">hyzedocs.vercel.app</a> • 🧠 <a href="https://hyzecode.vercel.app">hyzecode.vercel.app</a> </p>

🚀 Overview

HyzeMini (GGUF) provides quantized GGUF versions of the HyzeMini model, optimized for local execution using tools like llama.cpp, LM Studio, Ollama, and other GGUF-compatible runtimes.

This version keeps the same Space + General Chat focus, while enabling:

  • ⚡ Faster inference
  • 🧠 Lower memory usage
  • 💻 CPU-friendly execution

🧠 Model Details

  • Base model: HyzeAI / HyzeMini
  • Parameters: ~0.1B
  • Architecture: Transformer (LLaMA-style)
  • Format: GGUF
  • Language: English
  • License: Apache-2.0

🧪 Available Quantizations

(Exact files may vary depending on upload)

Common GGUF variants include:

  • Q2_K – Ultra-low memory, fastest
  • Q4_K_M – Balanced quality & speed (recommended)
  • Q5_K_M – Higher quality, slightly slower
  • Q8_0 – Best quality, highest memory usage

💡 If you’re unsure, start with Q4_K_M.


⚙️ Usage

llama.cpp

./main -m HyzeMini-Q4_K_M.gguf -p "Tell me a cool space fact:"

---

Author: HyzeAI

Likes: 1

Downloads: 0

Tags: transformers.js, gguf, hyze, local, chat, text-generation, en, base_model:HyzeAI/HyzeMini, base_model:quantized:HyzeAI/HyzeMini, license:apache-2.0, endpoints_compatible, region:us

tmkeating/H-Pylori-Contamination-Detection


language: en license: mit tags:

  • medical
  • histology
  • h-pylori
  • computer-vision
  • resnet metrics:
  • precision
  • recall
  • accuracy

Github Repo: https://github.com/tmkeating/H.-Pylori-Contamination-Detection

Final Project Report: H. Pylori Contamination Detection

1. Executive Summary

This project successfully developed a Deep Learning model to detect H. pylori bacteria in digital pathology slides. By leveraging Transfer Learning and an expanded negative-sampling strategy, the final model (Run 09) achieved a landmark 100% Recall (Clinical Safety) and 98.1% Precision (Operational Efficiency) on a large-scale holdout set of ~13,500 images.


2. Methodology & data Strategy

2.1 Architecture

The model uses a ResNet18 backbone, pre-trained on ImageNet. To optimize for bacterial detection, we upscaled tissue patches to 448x448 pixels, allowing the convolutional filters to resolve fine filamentous structures that are often lost at lower resolutions.

2.2 Data Expansion

The primary challenge was a 50:1 class imbalance. We addressed this by supplementing the pathologist-verified Annotated corpus with ~50,000 negative patches from "NEGATIVA" diagnosed patients.

  • Total Dataset: ~54,000 images.
  • Validation Strategy: 20% Stratified Holdout set to ensure the model was tested on a representative variety of both bacterial concentrations and healthy tissue textures.

2.3 Optimization Strategy

  • Weighted Loss: w=2.0 for the positive class.
  • Sampler: WeightedRandomSampler to ensure 1:1 batch balance during training.
  • Learning Rate: 5x10^-5 with a 15-epoch schedule for high-stability convergence.

3. Performance Analysis (Run 09)

3.1 Quantitative Results

09_101767_confusion_matrix

| Metric | Value | Interpretation | | :--- | :--- | :--- | | Recall | 100.0% | No contaminated cases were missed. The model is clinically safe for screening. | | Precision| 98.1% | Handled 13,243 negative samples with only 6 False Positives. | | AUC | 0.999997 | Near-perfect separation between classes. | | F1-Score | 0.99 | Excellent balance between sensitivity and specificity. |

3.2 Visual Analysis

A. Model Confidence (Probability Histograms)

09_101767_probability_histogram

The histogram shows a "bimodal" distribution where predictions are pushed toward the extreme ends (0.0 and 1.0). This indicates that the model is not "unsure" about its results; it makes decisive classifications with very few samples falling in the ambiguous 0.4–0.6 range.

B. Trading Precision for Recall (PR Curves)

09_101767_pr_curve

The Precision-Recall curve remains high (near 1.0) for almost the entire duration. This confirms that we can maintain our 100% recall without significantly sacrificing precision, which is a rare feat in imbalanced medical datasets.

C. Interpretability (Grad-CAM Heatmaps)

sample_0

sample_1

sample_2

sample_3

sample_4

Grad-CAM analysis reveals that the model's focus is mathematically aligned with the extracellular bacterial filaments. The "heat" is concentrated on the gastric pits and luminal surface, confirming that the model has learned the correct pathology rather than relying on image noise or slide preparation artifacts.

D. Training Stability (Learning Curves)

09_101767_learning_curves

The loss curves show a smooth decay with validation loss tracking closely with training loss. This indicates minimal overfitting, likely due to the massive influx of supplemental negative samples which acted as a powerful regularizer.


4. Conclusion

The "Supplemental Negative" strategy combined with a resolution of 448x448 has proven to be the winning configuration. The model is capable of processing thousands of biopsies with near-zero false alarms while guaranteeing that no contaminated sample goes undetected.

Recommended Implementation: Deploy as a secondary "Safety Audit" tool to flag potential missed cases for human reviewers.

Author: tmkeating

Likes: 1

Downloads: 0

Tags: medical, histology, h-pylori, computer-vision, resnet, en, license:mit, region:us

the-qa-company-official/Qwen3-VL-30B-A3B-Thinking-NVFP4

Author: the-qa-company-official

Likes: 1

Downloads: 0

Tags: safetensors, qwen3_vl_moe, 8-bit, compressed-tensors, region:us

Seryoger/runpod-endpoint-cache

Author: Seryoger

Likes: 1

Downloads: 0

Tags: region:us

Just999999/TextInterpreter


license: apache-2.0

Author: Just999999

Likes: 1

Downloads: 0

Tags: license:apache-2.0, region:us

Just999999/PromptCore


license: apache-2.0

Author: Just999999

Likes: 1

Downloads: 0

Tags: license:apache-2.0, region:us