AlicanKiraz0/Kara-Kumru-v1.0-2B
language:
- tr
- en license: apache-2.0 library_name: transformers tags:
- turkish
- fine-tuned
- text-generation
- question-answering
- summarization
- translation
- nlp
- cetvel-benchmark
- kumru datasets:
- custom base_model: vngrs-ai/Kumru-2B model-index:
- name: Kara-Kumru-v1.0-2B
results:
- task:
type: text-generation
name: Turkish Language Understanding
dataset:
type: custom
name: Cetvel Turkish LLM Benchmark
metrics:
- type: accuracy value: 37.56 name: Average Score
- type: f1 value: 32.54 name: QA Score
- type: rouge1 value: 32.55 name: SUM Score pipeline_tag: text-generation
- task:
type: text-generation
name: Turkish Language Understanding
dataset:
type: custom
name: Cetvel Turkish LLM Benchmark
metrics:
Kara-Kumru-v1.0-2B 🐦⬛
A 2B parameter Turkish LLM that outperforms 70B models on Turkish benchmarks.
Kara-Kumru-v1.0-2B is a fine-tuned version of vngrs-ai/Kumru-2B, specifically optimized for Turkish language tasks including question answering, summarization, and translation. Despite having only 2 billion parameters, it achieves 37.56 average on the Cetvel Turkish LLM Benchmark, surpassing Llama-3.3-70B-Instruct (36.25) — a model 35x its size.
<p align="center"> <img src="kara_kumru_v1_benchmark_v2.png" alt="Cetvel Turkish LLM Benchmark Leaderboard" width="100%"> </p><p align="center">Leaderboard scores for other models are sourced from vngrs-ai/Kumru-2B. Kara-Kumru-v1.0-2B scores were evaluated using our own Cetvel pipeline.</p>
Key Results
| Metric | Kara-Kumru-v1.0-2B | Llama-3.3-70B | Kumru-2B (baseline) | Delta vs baseline | |---|:---:|:---:|:---:|:---:| | Average | 37.56 | 36.25 | 31.98 | +5.58 | | QA | 32.54 🥇 | 23.97 | 6.50 | +26.04 | | SUM | 32.55 🥇 | 18.15 | 18.67 | +13.88 | | MT | 10.58 | 19.99 | 7.10 | +3.48 | | GEC | 64.96 | 30.10 | 66.34 | -1.38 | | MCQA | 42.02 | 60.70 | 39.69 | +2.33 | | NLI | 33.86 | 37.10 | 37.97 | -4.11 | | TC | 46.39 | 63.73 | 47.57 | -1.18 |
🥇 Kara-Kumru-v1.0-2B achieves the highest QA and SUM scores across the entire Cetvel leaderboard, including models up to 72B parameters.
Detailed Task-Level Results
<details> <summary>Click to expand full task breakdown</summary>| Task | Metric | Baseline | Kara-Kumru-v1.0-2B | Delta |
|---|---|:---:|:---:|:---:|
| tquad | f1 | 39.38 | 50.66 | +11.27 |
| xquad_tr | f1 | 31.46 | 39.27 | +7.81 |
| wmt-tr-en-prompt | bleu | 6.17 | 10.58 | +4.42 |
| xfact_tr | acc_norm | 40.83 | 44.38 | +3.55 |
| mkqa_tr | f1 | 5.29 | 7.70 | +2.41 |
| tr-wikihow-summ | rouge1 | 25.18 | 26.84 | +1.67 |
| wiki_lingua_tr | rouge1 | 24.44 | 26.04 | +1.60 |
| mlsum_tr | rouge1 | 42.11 | 43.55 | +1.44 |
| exams_tr | acc_norm | 31.55 | 32.57 | +1.02 |
| turkish_plu | acc_norm | 47.78 | 48.13 | +0.35 |
| ironytr | acc_norm | 50.00 | 50.00 | 0.00 |
| offenseval_tr | acc_norm | 79.71 | 79.71 | 0.00 |
| sts_tr | acc_norm | 11.75 | 11.75 | 0.00 |
| trclaim19 | acc_norm | 60.10 | 60.10 | 0.00 |
| xlsum_tr | rouge1 | 34.49 | 33.78 | -0.71 |
| nli_tr | acc | 35.31 | 33.86 | -1.46 |
| xcopa_tr | acc | 63.20 | 61.60 | -1.60 |
| gecturk_generation | exact_match | 68.39 | 64.96 | -3.43 |
| belebele_tr | acc_norm | 29.22 | 25.78 | -3.44 |
| news_cat | acc_norm | 38.80 | 32.40 | -6.40 |
Cetvel Leaderboard Position
#1 Kumru-7B 41.58 (7B)
#2 Kara-Kumru-v1.0-2B 37.56 (2B) ← YOU ARE HERE
#3 Llama-3.3-70B-Instruct 36.25 (70B)
#4 Kumru-2B 31.98 (2B)
#5 gemma-3-27b-it 27.73 (27B)
#6 gemma-3-12b-it 27.60 (12B)
#7 Qwen2-72B-Instruct 26.07 (72B)
...
Highlights
- 35x smaller, higher score: 2B params beating Llama-3.3-70B-Instruct on Turkish
- Best-in-class QA: 32.54 — highest QA score across ALL models in the Cetvel leaderboard, including 72B models
- Best-in-class SUM: 32.55 — highest summarization score across the entire leaderboard
- TQuAD breakthrough: +11.27 F1 improvement on Turkish reading comprehension
- Edge-deployable: Runs on a single consumer GPU, Mac Mini, or mobile device
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "AlicanKiraz0/Kara-Kumru-v1.0-2B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
messages = [
{"role": "user", "content": "Türkiye'nin en büyük gölü hangisidir ve özellikleri nelerdir?"}
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
output = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
response = tokenizer.decode(
output[0][inputs["input_ids"].shape[-1]:],
skip_special_tokens=True
)
print(response)
Quantized Inference (GGUF)
# For llama.cpp / Ollama users (if GGUF version is available)
ollama run AlicanKiraz0/Kara-Kumru-v1.0-2B
Training Details
Base Model
- Model: vngrs-ai/Kumru-2B
- Parameters: ~2B
- Architecture: Transformer decoder-only
Fine-tuning Configuration
- Method: Full fine-tuning
- Precision: BF16
- Hardware: SnakeEye Cluster (DGX Spark + Mac Studio M3 Ultra)
What Improved & Why
The fine-tuning primarily strengthened generative capabilities (QA, summarization, translation) while showing minor regression on some discriminative tasks (classification, NLI). This is a well-known trade-off in LLM fine-tuning — the model learned to produce better free-form Turkish text at the cost of some multiple-choice and classification accuracy.
| Capability | Direction | Interpretation | |---|---|---| | Question Answering (QA) | ⬆️ +7.17 | Extractive QA dramatically improved | | Translation (MT) | ⬆️ +4.42 | TR→EN translation quality increased | | Summarization (SUM) | ⬆️ +1.00 | Abstractive summarization improved | | Grammar Correction (GEC) | ⬇️ -3.43 | Exact-match GEC slightly regressed | | Natural Language Inference (NLI) | ⬇️ -1.46 | Entailment classification dipped | | Text Classification (TC) | ⬇️ -0.47 | Minor regression on classification |
Evaluation
All evaluations were performed using the Cetvel Turkish LLM Benchmark framework.
Cetvel Benchmark Categories
| Category | Description | |---|---| | GEC | Grammatical Error Correction (gecturk_generation) | | MCQA | Multiple Choice QA (belebele_tr, exams_tr, turkish_plu, xcopa_tr) | | MT | Machine Translation TR→EN (wmt-tr-en-prompt) | | NLI | Natural Language Inference (nli_tr) | | QA | Question Answering (xquad_tr, tquad, mkqa_tr) | | SUM | Summarization (mlsum_tr, xlsum_tr, tr-wikihow-summ, wiki_lingua_tr) | | TC | Text Classification (ironytr, news_cat, offenseval_tr, sts_tr, trclaim19, xfact_tr) |
Intended Use
- Turkish question answering and information extraction
- Turkish text summarization
- Turkish-to-English translation
- General Turkish language generation
- Research on efficient Turkish LLMs
Limitations
- Classification tasks: Some regression on text classification and NLI compared to baseline
- Grammar correction: GEC performance decreased by ~3.4 points
- Model size trade-offs: While competitive with much larger models on generative tasks, MCQA performance lags behind 7B+ models
- Evaluation caveat: Cross-pipeline benchmark comparison — see note above
Roadmap (Kara-Kumru-v2.0)
- [ ] Targeted GEC and NLI distillation to recover regression
- [ ] Classification-focused fine-tuning (news categorization, irony detection)
- [ ] MCQA and causal reasoning dataset expansion
- [ ] Unified evaluation pipeline for fair cross-model comparison
- [ ] GGUF quantization for edge deployment
Citation
@misc{kiraz2026karakumru,
title={Kara-Kumru-v1.0-2B: A Fine-tuned 2B Turkish LLM Outperforming 70B Models},
author={Kiraz, Alican},
year={2026},
url={https://huggingface.co/AlicanKiraz0/Kara-Kumru-v1.0-2B}
}
Acknowledgments
- VNGRS AI for the Kumru base model and the Cetvel benchmark framework
- Built on the SnakeEye Cluster — a multi-node system with DGX Spark and Apple Silicon nodes
Contact
Alican Kiraz
Kara-Kumru (lit. "Dark Dove") — named after the darker variant of the Eurasian collared dove. Small but fierce.
Author: AlicanKiraz0
Likes: 8
Downloads: 0
Tags: transformers, safetensors, mistral, text-generation, turkish, fine-tuned, question-answering, summarization, translation, nlp, cetvel-benchmark, kumru, conversational, tr, en, dataset:custom, base_model:vngrs-ai/Kumru-2B, base_model:finetune:vngrs-ai/Kumru-2B, license:apache-2.0, model-index, text-generation-inference, endpoints_compatible, region:us

