Atlas logo A
Kintsugi Collective - Gemma 4 Good Hackathon 2026

Atlas: Technical Specifications

Architecture, methodology, and training configuration


Model Foundation

Specification Detail Specification Detail
Base Model google/gemma-4-26b-a4b-it - 26B parameter Mixture of Experts Quantisation Q8_0 GGUF - 26.9GB
Deployment NVIDIA RTX 4080 SUPER 32GB VRAM Architecture 25.2B active parameters - 4,096 context (training), 262,144 context (inference capable)
Inference Engine llama.cpp, self-hosted End-point OpenAI-compatible API

Approach to Gemma4 26B Development

I spent considerable time researching fine-tuning methodologies - SFT, DPO, RLHF - and studying how each affects the base architecture of the model. It became evident that the conventional application of safety guardrails was limiting the model's generalisation abilities. Meanwhile, most in the open-source community were taking a wholesale approach to techniques such as Norm-Preserving Orthogonalisation and Expert Granular Abliteration, pioneered by grimjim, mlabonne, and p-e-w. The resulting models were not only unsafe - the essence of Gemma 4 was lost.

After reviewing the datasets and prompt material being used to "remove refusals" across the community, I couldn't in good conscience use those methods. Atlas is designed to help people in distress and increase overall safety for this cohort. Interfering with the inherent Region 1 safety guardrails was never an option.

What I noticed across every model I worked with - Claude, Gemini, Grok, ChatGPT, and Gemma 4 - was this: these models are already trained on the corpus of human knowledge. Humans are, by and large, ethical and moralistic creatures. We have darkness, but there is an underlying consensus within this corpus of knowledge - of Hope, Resilience, Determination, and Beauty. The question wasn't how to remove safety. It was: why are these models struggling with emotional contexts and nuance when the knowledge to do better is already there?

"The answer was surgical precision - not removal. Separate the harmful content refusal from the crisis service redirection. They live in different layers. They can be reached independently."

Abliteration Methodology

Norm-preserving biprojected abliteration with Expert-Granular Abliteration (EGA), following TrevorJS methodology with Kintsugi Collective's region-class isolation contribution. Applied as a five-stage sequential process:

Step 1: Applied to all 30 layers (o_proj + mlp.down_proj) - full-depth coverage across the entire architecture
Step 2: Full expert ablation - 128/128 experts per layer, ensuring no expert cluster retains the target behaviour
Step 3: Direction computed as normalize(mean(harmful) − mean(harmless)) with Gram-Schmidt orthogonalisation to isolate the refusal vector cleanly
Step 4: Winsorisation at 99.5th percentile to preserve norm integrity - preventing weight collapse at the extremes of the distribution
Step 5: Scale factor 0.95 - deliberate conservative application, preserving model coherence while achieving the targeted behavioural shift

Supervised Fine-Tuning

Category Detail
Dataset Size 1,800+ examples - 60% carefully structured synthetic, 40% redacted lived-experience data from the target cohort
Training Streams Three streams: authentic conversational exports; refusal-redirect pairs targeting therapeutic false positives; constructed seeds across the 10-category safety taxonomy
Framework Unsloth + bf16 precision - RTX 6000 Blackwell
Final SFT Loss 0.157 - clean convergence

SFT Parameters

Epochs3
Batch Size4 (effective)
Learning Rate2e-4
LR SchedulerLinear
Warmup Steps10
OptimiserAdamW 8-bit
LoRA Rank32 (α=64)

Abliteration Parameters

Layers100% (all 30)
Experts128/128 per layer
Scale0.95
Winsorisation0.995
OrthogonalisationGram-Schmidt
Region 1 PreservedYes - fully

Benchmark Results

Atlas evaluated against base Gemma-4-26B across standard benchmarks

0%
Therapeutic Refusal Rate
↓ from 29% base
80.8%
GSM8K Reasoning
↑ +37.7% vs base
50.1%
HellaSwag
↑ +7.7% vs base
0.157
Final SFT Loss
Clean convergence
BenchmarkBase Gemma-4AtlasDelta
GSM8K (Mathematical Reasoning)43.1%80.8%+37.7%
HellaSwag42.4%50.1%+7.7%
MMLU - Clinical Knowledge40.0%46.0%+6.0%
MMLU - High School Psychology53.9%62.0%+8.1%
MMLU - Human Sexuality46.6%56.5%+9.9%
MMLU - Computer Security47.0%56.0%+9.0%
MMLU - Logical Fallacies47.2%52.8%+5.6%
MMLU - Medical Genetics45.0%52.0%+7.0%
MMLU - High School Biology61.0%67.1%+6.1%
MMLU - World Religions45.6%54.4%+8.8%
MMLU - Macroeconomics47.4%56.7%+9.2%
MMLU Average47.6%49.4%+1.8%
TruthfulQA MC254.3%56.5%+2.2%
ToxiGen*45.5%45.9%+0.3%
ARC Challenge29.2%30.9%+1.7%
Winogrande50.9%51.9%+1.0%
MMLU - International Law68.6%61.2%−7.4%
MMLU - Public Relations50.9%40.9%−10.0%
MMLU - High School Physics49.6%45.7%−3.9%

* Removal of Region 2 therapeutic refusals did not impact toxic prompt detection. Region 1 (weapons, CSAM, targeted violence) fully preserved. Regressions in International Law, Public Relations, and High School Physics are in domains architecturally unrelated to the modification target and consistent with expected fine-tuning variance.


Kintsugi Collective Benchmarks

Evaluation dimensions specific to the Atlas cohort - measures that standard benchmarks are not designed to capture.

0%
Therapeutic Refusal Rate
Atlas achieved a 0% therapeutic refusal rate on the full cohort-specific prompt set - down from 29% in the base Gemma-4 model. This is the primary target metric of the Atlas development pipeline and the measure that matters most to the population this system serves.
ConcernAtlas ResponseRating
Re-traumatisation via refusals Surgical abliteration - 0% therapeutic refusal rate on cohort-specific prompts Excellent
Presence & abandonment Core philosophy ("the one that stays") deeply trained into model weights Excellent
User sovereignty & agency Sovereign Signal Vault, split-key encryption, user-directed interaction Outstanding
Pathologising language Explicit system constraints + targeted training data Very Strong
Neurodivergence respect Training explicitly covers masking, shutdowns, executive dysfunction, sensory issues Strong
Privacy of trauma disclosures On-device Prompt Shield tokenisation, E2E encryption, no server-readable data Industry-leading
Generic crisis pivots Hard constraint in training data and system prompt - pattern detection before escalation Excellent